How multi-probe voting eliminates false alerts

If you've ever been woken up at 3 AM by a monitoring alert, only to find your site was perfectly fine, you know the feeling. You check the dashboard, everything is green, and you go back to bed wondering why you're paying for a service that can't tell the difference between an outage and a network hiccup.

This happens a lot. The industry number that gets thrown around is that 85% of uptime alerts are false positives. I don't know if that's exactly right, but having been on the receiving end of monitoring alerts for years, it feels right.

The root cause is almost always the same: a single probe in a single location runs a check, gets a timeout or a connection blip, and fires an alert. One bad data point, one page. That's how most monitoring tools work, and it's why most teams eventually start ignoring their alerts.

The problem with single-probe checks

Think about what happens when your monitoring service checks your site from a single server in Virginia. That check has to travel through the public internet, hit your server, and come back. There are a lot of things that can go wrong along the way that have nothing to do with your service being down.

The probe's ISP has a routing issue. A submarine cable is having a bad day. There's packet loss at a peering point. Your CDN is slow to respond from that specific region. Any of these will cause a timeout, and a timeout means an alert.

The monitoring service doesn't know why the check failed. It just knows it did. So it alerts you.

And it gets worse. Some monitoring tools try to mitigate this by retrying from the same location. But if the problem is a regional network issue, the retry fails too, and now you've got a "confirmed" outage that isn't one.

Majority voting

The fix is to check from multiple locations at the same time, and only alert when a majority of probes agree that something is wrong.

Larm runs every check from multiple global probes. Every 5 seconds, the evaluator collects the most recent result from each probe within a lookback window and tallies the votes. If more than half the probes report failure, the vote is :majority_fail. If more than half report success, it's :majority_pass. Ties produce no state change.

Nuremberg   → timeout    ✗
Helsinki    → 200 OK     ✓
Virginia    → 200 OK     ✓
Singapore   → 200 OK     ✓

Votes: 3 pass, 1 fail → majority_pass → no alert

One probe timed out. Three others confirmed the service is up. The system correctly identifies this as a probe-level or network-level issue, not an outage.

The lookback window is adaptive. It's based on the monitor's check interval: max(interval * 2 / 60 + 1, 3) minutes. A monitor checking every 60 seconds gets a 3-minute window. A monitor checking every 5 minutes gets an 11-minute window. This ensures at least 2 check cycles are considered and gives probes time to report in, accounting for scheduling drift and network delays.

Confirmation windows

Voting by itself handles the common case: a single probe has a bad moment. But network issues can affect multiple probes temporarily. A BGP route change can make your server unreachable from half the internet for 30 seconds. A brief majority of probes might report failure, but it resolves on its own.

Confirmation windows handle this. When the vote flips to :majority_fail, Larm doesn't immediately transition the monitor to down. It starts a timer. If the majority keeps reporting failure consecutively for the full confirmation window, the monitor transitions. If the vote flips back to pass at any point during the window, the timer resets. The blip was absorbed.

The default confirmation window for going down is 1 minute. For recovery, it's 3 minutes. The asymmetry is intentional: recovering too quickly from an outage and then going back down creates noisy, flapping alerts. Both are configurable per monitor, from 0 (instant) to 30 minutes.

This works in combination with majority voting. Voting filters out single-probe network issues. Confirmation windows filter out brief multi-probe disruptions. Together, they mean that when an alert fires, the service has been confirmed down by most probes for a sustained period.

What happens when a real outage is confirmed

When the confirmation window expires and the monitor transitions from up to down, several things happen in the same code path:

The state transition is persisted to Postgres with a timestamp and the previous state.
An alert is dispatched through all configured channels (Slack, email, PagerDuty, etc.) via an Oban job.
If the monitor is linked to a status page component, a disruption is automatically created.
The state change is broadcast via Phoenix PubSub, so any user looking at the dashboard sees it in real-time.

The alert includes the check results from all probes, so the engineer receiving it can immediately see which probes are failing and what the request waterfall looks like from each location. Instead of "your site is down," they get "TCP connection is timing out from all 4 locations, DNS and TLS are fine." That's a server problem, not a network problem. Or they see "timeout from Singapore and Sao Paulo, 200 OK from Europe and US." That's probably a routing issue in those regions, not a server problem.

The tradeoff

We don't mark a monitor as down immediately. That's a feature, not a limitation. With the default 1-minute confirmation window, a real outage is confirmed and alerted within about a minute. In exchange, you get alerts you can trust.

That trust is the whole point. When alerts are reliable, people respond to them. When they're not, people start ignoring all of them, including the real ones. A monitoring system that cried wolf once too often is worse than no monitoring at all. Reliable detection means reliable response, and that's what actually keeps your services up.

This is the core of how Larm approaches monitoring. If you're tired of alert fatigue, give it a try. The free plan includes 15 monitors with multi-probe voting from all locations.

The problem with single-probe checks

Majority voting

Confirmation windows

What happens when a real outage is confirmed

The tradeoff

Start monitoring in minutes.