How to monitor a Fly.io app

Fly.io's pitch is running your app close to your users, on machines distributed across the globe. It's a compelling model. Your API in Amsterdam, another instance in Chicago, another in Sydney. Requests get routed to the nearest healthy machine.

The "healthy" part is where monitoring comes in. Fly runs health checks internally to decide whether to route traffic to a machine. But those checks are about whether Fly should send traffic to the machine, not about whether your users are having a good experience. Your app can pass Fly's health checks while returning wrong data, being painfully slow from certain regions, or serving an expired SSL certificate.

What to monitor

Your public hostname. The .fly.dev domain or your custom domain. This is the URL your users hit, and it's the URL you should monitor. An HTTP check against this URL goes through Fly's proxy layer, which means it tests the full path a real user would take, including routing, TLS termination, and machine selection.

Region-specific performance. This is where Fly monitoring gets interesting. Your app might be healthy in Frankfurt but struggling in Sydney because the Sydney machine has less memory, a cold volume mount, or is running an older deploy. Monitoring from multiple locations lets you see these regional differences.

If you only monitor from one location, you're testing one region's machine. You have no idea what the other regions look like.

Health check endpoints. If your app has a /health or /up endpoint, use it. A good health check on Fly should verify that the app can connect to its database. On Fly, this often means connecting to a Postgres cluster over the internal network, and that connection can fail independently of the app itself.

If you're using Fly Postgres, a health check that runs a simple query (SELECT 1) catches connection pool exhaustion, replication lag, and networking issues between your app machine and the database machine.

Custom domain SSL. If you've added a custom domain, Fly provisions an SSL certificate for it. Like any automated certificate, renewal can fail. DNS misconfiguration, CAA record issues, or Fly platform problems can all prevent renewal. Monitoring the certificate expiry date means you hear about it at 28 days out, not when browsers start blocking your site.

Fly-specific failure modes

Machine restarts. Fly machines get restarted. Deployments, host maintenance, resource limits, OOM kills. Fly handles this automatically by starting a new machine, but there's a gap. The old machine is gone, the new one is booting, and for a few seconds (or longer, depending on your app's startup time), that region might not have a healthy machine.

If you only have one machine per region, this restart window means downtime for users in that region. Monitoring from multiple global locations surfaces this. You'll see a check fail from one region while others pass.

Volume issues. Fly volumes are local to a specific host. If the host goes down or your machine gets moved to a different host, the volume might not follow. Your app starts, can't find its data directory, and either crashes or runs in a degraded state. A health check that reads from the volume catches this.

Deploy propagation. When you deploy on Fly, machines get updated region by region. There's a window where some regions are running the new version and others are running the old one. This is usually fine, but if the new version has a breaking change (a database migration that the old version doesn't understand, a changed API contract), you can have inconsistent behavior across regions.

Internal networking. Fly apps communicate over an internal WireGuard network (6PN). Your app talks to Postgres, Redis, or other services over this network. This network is separate from the public internet, and it can have issues independently. Your app might be reachable from outside but unable to connect to its database internally.

Setting it up

Create an HTTP monitor in Larm for your Fly app's public URL. Larm checks from multiple global locations, so you'll see regional issues that single-location monitoring misses.

Things to consider:

Timeout. Fly machines can have cold starts, especially if you're using auto-stop and auto-start. If your machines scale to zero, the first request after idle time needs to boot the machine, which can take several seconds. Set a timeout that accounts for this. 10-15 seconds for apps with cold starts, 5 seconds for always-on machines.

Check interval. For production apps, 1-minute checks give you fast detection. If you're on Larm's free plan, 3-minute checks are available and still catch most issues within a reasonable window.

Keyword validation. If your health endpoint returns structured JSON like {"status":"ok","db":"connected"}, check for the keyword "connected". This catches cases where the endpoint returns 200 but the database connection is actually down.

Multiple services on Fly

If you're running more than one Fly app (API, frontend, workers), each one needs its own monitor. For HTTP services, that's an HTTP monitor per service. For background workers that process queues but don't serve HTTP, use heartbeat monitoring. Have the worker ping a URL after each successful job cycle, and get alerted when the pings stop.

If your services depend on each other, the health check for each service should verify its own dependencies. The API's health check verifies the database. The frontend's health check verifies the API. This way, a failure in one layer surfaces through monitoring of the dependent layer.

Fly.io gives you multi-region infrastructure. Multi-region monitoring is the other half of that. Larm's free plan includes 15 monitors checked from all probe locations, which covers a typical Fly project and then some.

What to monitor

Fly-specific failure modes

Setting it up

Multiple services on Fly

Start monitoring in minutes.