Monitoring fundamentals

Cron heartbeat monitoring: catch failed jobs before customers do

Scheduled jobs fail silently more often than any other infrastructure. Heartbeat monitoring is the simple pattern that surfaces them before someone files a ticket.

The silent failure problem

Of all the things that can break in production, scheduled jobs are uniquely bad at telling you they failed.

A web service that\'s down generates failed user requests, support tickets, and obvious metrics blips. Customers notice within minutes. A scheduled job that doesn\'t run produces nothing — no traffic, no errors, no obvious symptoms. The only thing missing is the result, and the result was something nobody was watching for.

Common silent failures:

  • Nightly backups that didn\'t run for two weeks. Discovered when someone needed to restore.
  • Customer reports that should have been emailed daily; users complain after a week of no reports.
  • Queue processors that died; tasks pile up but the metric nobody set up doesn\'t fire.
  • Sync jobs between systems; data drifts silently until someone notices.
  • Certificate renewal automation that hasn\'t run; cert expires (related: our cert post).

How heartbeat monitoring works

Heartbeat monitoring inverts the normal monitoring pattern. Instead of the monitoring tool checking your service from the outside, your service pings the monitoring tool when it does its job.

The mechanic:

  1. You configure a heartbeat monitor in your tool, set to expect a ping every (say) 24 hours, with a grace period of (say) 1 hour.
  2. Your job, when it completes successfully, sends an HTTP request to a URL the tool gives you.
  3. The tool resets its countdown.
  4. If the countdown ever hits zero (no ping received in the expected window), the tool fires an alert.

It\'s sometimes called "dead man\'s switch" monitoring — the alert fires when the heartbeat stops.

What jobs deserve heartbeat monitoring

The simple test: does anything bad happen if this job doesn\'t run? If yes, monitor it.

Common candidates:

  • Database backups (the textbook example).
  • Data syncs between systems (CRM ↔ warehouse, billing ↔ accounting).
  • Scheduled email sends (digests, reports, billing notifications).
  • Cleanup jobs that prevent disk-full conditions.
  • Certificate renewal automation.
  • Index rebuilds, cache warmers.
  • Recurring billing cycles.
  • Compliance / audit log archival.

Skip: ad-hoc cleanup of temp files, debug noise generators, anything where "didn\'t run" is harmless.

Implementing heartbeats: the simplest version

The implementation is trivially simple. Most monitoring tools give you a unique heartbeat URL. Your job hits it on success.

Cron

0 3 * * * /usr/local/bin/backup.sh && curl -fsS https://anyping.com/heartbeat/abc123 > /dev/null

The && ensures the heartbeat only fires if the backup succeeds. The > /dev/null stops cron from emailing the heartbeat response.

Bash script with start + complete pings

#!/usr/bin/env bash
HEARTBEAT_URL="https://anyping.com/heartbeat/abc123"

# Tell monitoring we started
curl -fsS "$HEARTBEAT_URL/start" > /dev/null

if /usr/local/bin/backup.sh; then
  curl -fsS "$HEARTBEAT_URL" > /dev/null
else
  curl -fsS "$HEARTBEAT_URL/fail" > /dev/null
  exit 1
fi

Application code (Python example)

import requests

def daily_report_job():
    try:
        send_daily_reports()
        requests.get("https://anyping.com/heartbeat/abc123", timeout=5)
    except Exception:
        requests.get("https://anyping.com/heartbeat/abc123/fail", timeout=5)
        raise

Kubernetes CronJob

Add the heartbeat curl as the last command in the container, or as a sidecar that fires after the main container exits successfully.

Handling real-world noise

Real cron jobs don\'t run on perfect schedules. Things to plan for:

Jitter in execution time

A "1am" cron job might run at 1:00:30 one night and 1:01:45 the next. Set the grace window 10–20% larger than the longest expected runtime so normal variation doesn\'t trip the alert.

DST transitions

Cron jobs scheduled in local time can skip or double-fire across DST. Either schedule in UTC or accept that twice a year you\'ll have an oddity.

Network blips on the heartbeat send

The heartbeat ping itself can fail due to network issues, even when the job ran fine. Most monitoring tools allow a few missed pings before alerting; configure 1–2 missed pings as the threshold rather than 1.

Job runtime growing over time

Backup that took 45 minutes when you set up monitoring takes 2 hours after a year of data growth. Set up a separate alert when actual runtime approaches your grace window so you can adjust before it starts firing.

Common pitfalls

Sending the heartbeat unconditionally

Bad: backup.sh; curl heartbeat. The semicolon means heartbeat fires whether backup succeeded or not. You\'ll never know about backup failures. Use &&.

Heartbeat URL hardcoded in source control

Treat the heartbeat URL like a secret. Anyone with the URL can ping your monitor and silence alerts. Pass it via environment variable, not committed to git.

Same heartbeat URL for multiple jobs

Each job needs its own heartbeat. Sharing means one job running covers for another that\'s broken. The whole point is independent verification per job.

No alert routing for heartbeat failures

"Backup didn\'t run" should page someone on the data team, not the general on-call. Set up routing so heartbeat alerts go to the team that owns the job, not to whoever happens to be on rotation.

Forgetting to monitor the cron daemon itself

If cron itself dies, all your heartbeats fail simultaneously. Worth a separate (rough) check that the cron process is running — or even simpler, have a "cron is alive" job that runs every 5 minutes and pings a heartbeat.

Heartbeat monitoring is one of those patterns that costs almost nothing to implement and prevents a class of incidents that nothing else catches. Set it up for every important scheduled job; thank yourself later.

Frequently asked questions

How is this different from running a regular monitor against the job?

Regular monitors check externally: "is the API responding?" Heartbeat monitoring inverts this: the job tells the monitoring tool "I ran." This catches failure modes regular monitoring can't see — like a job that didn't even start because the cron daemon died, or a job that queued but never executed.

What's a reasonable grace period?

Job runtime + 10–20% jitter is a good starting point. A backup that takes 45 minutes should have a grace window of about 60 minutes. Set it tighter for jobs that need to complete by a deadline (overnight batch processing); looser for jobs that just need to happen eventually.

What if my job legitimately takes variable time?

Use start + complete heartbeats instead of just complete. The "started" ping confirms it kicked off; the "completed" ping confirms it finished. You can alert on either being missing, with different urgency.

Should every cron job have a heartbeat?

Anything that has business consequences if it doesn't run, yes. Cleanup jobs that delete temp files: probably not worth the noise. Backups, syncs, billing runs, scheduled emails, report generation: absolutely.

What about jobs in Kubernetes / queue workers / Lambda?

Same pattern, different transport. K8s CronJobs, Sidekiq schedulers, ECS scheduled tasks, Lambda EventBridge schedules — all benefit from outbound heartbeat pings. The pattern is "if it's scheduled, monitor that it ran."

Start watching your sites in 5 minutes.

14-day free trial. No credit card required. Cancel anytime.