Operations

DNS monitoring: the layer everyone forgets

You can monitor your application perfectly and still take a 30-minute outage from a single DNS misconfiguration. Here's what to watch and why.

Why DNS gets the blame (and deserves more of it)

"It\'s always DNS" is a meme for a reason. DNS is the foundational name-resolution layer that everything else depends on; when it\'s wrong, nothing works downstream. But what makes DNS uniquely problematic for monitoring is that it tends to fail in ways that look like other kinds of failures.

An A record pointing at the wrong IP looks identical to "the application is down" from a customer\'s browser. An expired domain looks like "the site no longer exists." A bad MX record looks like "you didn\'t send us an email" rather than "your mail server bounced ours."

Worse, DNS errors propagate slowly. By the time you notice, fix it, and the fix propagates, you\'ve been down to a meaningful slice of the internet for an hour.

Records worth watching

A reasonable monitoring inventory:

  • A and AAAA records. Make sure your apex and key subdomains point at the IPs you expect.
  • CNAME records. Especially for status pages, marketing landing pages, and CDN-fronted services.
  • MX records. Inbound mail delivery. Misconfigured MX = silent failure.
  • SPF, DKIM, DMARC TXT records. Outbound mail authentication. Drift here causes deliverability issues.
  • Domain expiry. Different from cert expiry; the domain registration itself.
  • NS records. Pointing at your DNS provider. Changes here are a serious security signal.
  • DNSSEC chain. If you\'ve enabled it.

Unexpected-change detection

The most useful DNS monitoring pattern: alert when a record changes from its known-good value, not just when it disappears.

You configure the monitor with the expected current value. If the live record returns something different, alert. This catches:

  • Accidental record deletion or modification by a teammate.
  • DNS hijacking (compromised registrar or DNS account).
  • Provider-side issues that return wrong data.
  • Propagation problems mid-migration (some resolvers see new, some see old).

Update the expected value every time you legitimately change a record. The minor friction is the point — it forces you to acknowledge the change.

Propagation and multi-resolver checks

DNS records are cached at every layer between you and the authoritative servers: client OS, browser, ISP recursive resolver, public resolvers (1.1.1.1, 8.8.8.8), corporate DNS. A change you make takes time to propagate everywhere.

During migrations, this gets messy. Different users see different states for hours. Real DNS monitoring uses multiple resolvers and reports on differences:

  • Authoritative answer (direct from your DNS provider).
  • Public resolver answers (Cloudflare, Google, OpenDNS, Quad9).
  • Regional resolver samples (China, India, Brazil).

If they disagree, you have a propagation issue affecting some fraction of users. If only one resolver disagrees for a long time, that resolver has a bug or stale cache.

The MX record trap

The single most overlooked DNS issue: MX records.

If your MX records are wrong, inbound email to your domain bounces (or worse, gets delivered to the wrong server). You don\'t see it — the people trying to email you see the bounce. They might tell you. They might not.

For a SaaS or any business that depends on inbound email (support, sales, signups), this is a slow-motion disaster. A team we know discovered their MX records had been wrong for 11 days only because a customer mentioned bounces during a sales call.

Monitor MX records like you\'d monitor any other critical config. Alert on changes. Test deliverability periodically.

TTL strategy and how it interacts with monitoring

TTL (time-to-live) controls how long resolvers cache your records. Long TTLs (24h+) reduce DNS query volume but make changes propagate slowly. Short TTLs (5min) make changes propagate quickly but increase query volume.

Implications for monitoring:

  • Short TTLs let you respond faster to DNS issues by changing records.
  • Long TTLs mean even a quick fix won\'t propagate for hours.
  • If you\'re planning a migration, drop the TTL to 5 minutes 24+ hours before the change.
  • Monitor that TTL drops happen before changes (otherwise you\'re committing to a long propagation window).

Common DNS-caused incidents

Real-world DNS incidents we\'ve seen take businesses down:

The "I cleaned up old records" incident

Engineer reviews DNS records, deletes ones that look unused. One was actually critical (used by a third-party integration). The third party\'s requests start failing; integration breaks; nobody notices for a day.

The TTL-300 mistake

Team migrates DNS providers. New provider defaults all records to 86400 (24h) TTLs. The old TTLs were 300. Migration looks complete; really, half the resolvers in the world are still serving old records for the next 24 hours.

The lapsed domain

Domain registration auto-renewal fails (expired credit card on file). Domain enters grace period; resolvers start failing. By the time the team notices, restoration takes 24–72 hours.

The compromised registrar account

Attacker phishes the team\'s DNS provider login. Changes A records to point at attacker-controlled servers. Used to intercept email, run phishing, or steal cookies. Detected only because someone notices the site looks slightly off.

The "we forgot SPF"

Team migrates email providers. New provider needs new SPF/DKIM records. Old SPF still in DNS. Mailbox providers reject outbound mail because authentication fails. Customer signup emails stop arriving.

The pattern across all of these: DNS issues are silent until they\'re catastrophic. Monitoring is cheap; the alternative is finding out when a customer is already affected.

Frequently asked questions

What's the difference between DNS monitoring and uptime monitoring?

Uptime monitoring resolves your hostname and connects to whatever IP the DNS returns. If DNS is wrong, uptime monitoring connects to the wrong place — and may either fail (looking like an outage) or succeed against a stale or wrong server. DNS monitoring inspects the records themselves to ensure they point where you expect.

How often should DNS records be checked?

Daily is plenty for most records. The exception is during planned changes — bump the cadence to every 5 minutes during a DNS migration so you can monitor propagation in real time.

What about DNSSEC validation?

If you use DNSSEC, monitor DNSSEC chain validity in addition to record contents. A broken DNSSEC chain causes hard failures for resolvers that validate — some won't resolve your name at all.

Should I monitor my DNS provider's uptime?

Indirectly yes — if your DNS provider goes down, your records can't resolve. The mitigation is multi-provider DNS (using two providers simultaneously). Big providers have had multi-hour outages; the cost of dual-provider DNS is small.

How do I monitor wildcard records?

Pick a representative subdomain (e.g., test.acme.com against a *.acme.com record) and monitor that. If the wildcard breaks, the test subdomain will show it.

Start watching your sites in 5 minutes.

14-day free trial. No credit card required. Cancel anytime.