How to monitor cron jobs and background workers

The thing about cron jobs is that when they work, nobody notices. They run in the background, do their thing, and life goes on. When they stop working, nobody notices either. At least not right away.

A web server that goes down gets caught quickly. Users see errors, monitoring tools detect failed HTTP checks, someone gets paged. But a cron job that stops running? There's no error. There's no failed request. There's just... silence. The job that was supposed to run every hour at :15 just doesn't run. Nothing breaks immediately. The database doesn't complain. The logs don't show anything because there's nothing to log.

You find out three days later when someone asks why the nightly report didn't go out, or why the queue has 50,000 unprocessed items, or why the billing sync hasn't run since Tuesday.

Why cron jobs stop running

Usually it's not the job itself that breaks. It's the thing that runs the job.

The server rebooted and the crontab didn't survive. This happens more often than it should, especially on machines where cron entries are added manually instead of managed through configuration.

A deploy changed the environment. The job runs fine locally but the production container doesn't have the right env vars, or the path changed, or a dependency is missing.

The job is running but failing. It starts, hits an error, exits with a non-zero code. Cron doesn't care. It ran the command, that's its job. Whether the command succeeded is not cron's problem.

Resource contention. The job takes longer than the interval between runs. The previous instance is still running when the next one starts. Depending on how the job is written, this can cause duplicate work, lock contention, or cascading failures.

Kubernetes CronJob edge cases. The job's pod gets evicted, the node is under pressure, the job hits its activeDeadlineSeconds, or the concurrency policy drops it. Kubernetes logs this, but you have to be looking.

The heartbeat pattern

The idea is simple. Instead of trying to monitor the job from outside (which is hard, because there's nothing to observe when a job doesn't run), you flip it around. The job announces that it ran successfully, and you monitor for the absence of that announcement.

It works like this:

You create a heartbeat monitor with an expected interval (say, every hour).
The monitor gives you a unique URL.
At the end of your job, you ping that URL. A simple HTTP GET or POST.
If the monitor doesn't receive a ping within the expected interval, it alerts you.

That's it. The monitor doesn't know or care what your job does. It just knows that your job is supposed to check in every hour, and if it doesn't, something is wrong.

This is sometimes called a "dead man's switch." If the job is alive and healthy, it keeps pinging. If it stops pinging, it's dead (or stuck, or failing, or not running at all). The absence of a signal is the signal.

Adding the ping to your job

The ping should go at the end of your job, after the actual work is done. If the job starts but fails halfway through, you don't want it to report success.

For a shell script in cron:

#!/bin/bash
set -e
 
# Do the actual work
python /app/generate_report.py
rsync -a /data/reports/ /backup/reports/
 
# Ping heartbeat on success
curl -fsS --retry 3 https://app.larm.dev/heartbeat/your-unique-id

The -fsS flags make curl fail silently on HTTP errors and --retry 3 handles transient network issues. The set -e at the top means the script exits on any error, so the curl only runs if everything above it succeeded.

For an Oban worker in Elixir:

defmodule MyApp.Workers.NightlyReport do
  use Oban.Worker, queue: :reports
 
  @heartbeat_url "https://app.larm.dev/heartbeat/your-unique-id"
 
  @impl Oban.Worker
  def perform(_job) do
    with :ok <- generate_report(),
         :ok <- send_notifications() do
      ping_heartbeat()
      :ok
    end
  end
 
  defp ping_heartbeat do
    Req.get!(@heartbeat_url)
  end
end

For Sidekiq in Ruby:

class NightlyReportJob
  include Sidekiq::Job
 
  HEARTBEAT_URL = "https://app.larm.dev/heartbeat/your-unique-id"
 
  def perform
    generate_report
    send_notifications
    Net::HTTP.get(URI(HEARTBEAT_URL))
  end
end

The pattern is the same regardless of language or framework. Do the work, then ping.

What about systemd timers?

If you're using systemd timers instead of cron, the same approach works. Add an ExecStartPost to your service unit that pings on success:

[Service]
ExecStart=/app/generate_report.sh
ExecStartPost=/usr/bin/curl -fsS https://app.larm.dev/heartbeat/your-unique-id

ExecStartPost only runs if ExecStart exits successfully, so you get the same "only ping on success" behavior.

Choosing the right interval

The heartbeat interval should match how often the job is supposed to run, plus some grace period. If your job runs every hour, set the expected interval to something like 75 minutes. This gives the job time to run and accounts for minor scheduling drift without triggering false alerts.

For jobs that run less frequently (daily, weekly), a longer grace period makes sense. A daily job with a 25-hour expected interval gives you an hour of buffer before alerting.

The key thing is that the interval is about how long silence is acceptable, not how long the job takes to run. A job that takes 5 minutes to run but is scheduled every hour should have a ~75 minute interval, not a 5 minute one.

Setting this up in Larm

Create a heartbeat monitor in Larm, set the expected interval, and you get a unique URL. Add the ping to your job. If the ping stops coming, Larm alerts you through your configured channels, same as any other monitor going down.

The heartbeat endpoint is intentionally simple. It accepts GET or POST, ignores the body, and responds with a 200. No authentication, no payload format, no SDK needed. Just hit the URL.