Learn/ Docs/ Operations/ Monitoring

operations

DNS Monitoring and Debugging

Tools and techniques for diagnosing DNS problems — from command-line queries to visual chain analysis

DNS fails silently

When DNS breaks, it doesn’t throw an error page. It doesn’t return a 500 status code. The browser simply hangs, shows “This site can’t be reached,” and the user assumes the site is down. The site’s servers may be running perfectly — the problem is that nobody can find them.

DNS issues are among the hardest infrastructure problems to diagnose because the failure manifests far from its cause. A misconfigured TTL, a lame delegation, an expired DNSSEC signature, or a missing glue record can make a domain unreachable from some networks while working fine from others. Effective DNS debugging requires understanding the resolution chain and knowing how to inspect each link.

Command-line tools

dig — the DNS Swiss Army knife

dig (Domain Information Groper) is the standard DNS query tool, shipped with BIND and available on nearly every Unix system. It sends a DNS query to a specified server and displays the raw response, including all sections and flags.

Basic query:

dig example.com A

Query a specific nameserver:

dig @8.8.8.8 example.com A

Trace the full resolution chain (queries root → TLD → authoritative, showing each referral):

dig +trace example.com A

Show just the answer, no noise:

dig +short example.com A

Check DNSSEC signatures:

dig +dnssec example.com A

The +trace option is particularly powerful for debugging. It bypasses your local resolver’s cache and walks the delegation chain from the root servers, showing exactly which servers are consulted and what they return at each step. When a domain fails to resolve, +trace usually reveals where the chain breaks.

Key output fields to watch:

FieldWhat it tells you
status: NOERRORQuery succeeded
status: NXDOMAINDomain does not exist
status: SERVFAILServer error — often DNSSEC validation failure
flags: aaResponse is from an authoritative server
flags: rd raRecursion desired and available
TTL valuesHow long the answer is cached
AUTHORITY SECTIONWhich nameservers were delegated
ADDITIONAL SECTIONGlue records and OPT (EDNS) information

drill

drill is Unbound’s alternative to dig, developed by NLnet Labs. It has a cleaner output format and better DNSSEC chain-tracing:

drill -T example.com     # Trace from root
drill -S example.com     # Chase DNSSEC chain

The -S (chase) option is excellent for DNSSEC debugging — it follows the trust chain from the root, validating each signature and DS record along the way. When a DNSSEC validation fails, drill -S shows exactly which link in the chain is broken.

doggo

A modern, human-friendly DNS client written in Go. Doggo presents DNS responses in a colorized, tabular format that’s easier to scan than dig’s output. It supports DNS over HTTPS, DNS over TLS, and DNS over QUIC natively.

nslookup

The oldest DNS query tool, dating back to BIND 4. Still available on every operating system including Windows. While dig has largely replaced it for serious debugging, nslookup remains useful for quick checks:

nslookup example.com 8.8.8.8

Diagnostic scenarios

“The site works for me but not for the client”

This is the most common DNS debugging scenario. It usually means one of:

  1. Caching difference. Your resolver has a fresh record; the client’s resolver has a stale or incorrect cached entry. Ask the client to try dig @8.8.8.8 example.com to query a public resolver directly. If that works, the issue is their local resolver’s cache.

  2. Negative caching. The domain was briefly misconfigured, causing NXDOMAIN responses that were cached. Even after fixing the configuration, resolvers that cached the NXDOMAIN will continue returning it until the negative cache TTL expires. The SOA MINIMUM field controls this TTL.

  3. Propagation delay. DNS changes propagate at the speed of TTL expiration, not at the speed of light. If you changed a record with a 3600-second TTL, it could take up to an hour for all resolvers to pick up the new value. Some resolvers honor TTLs strictly; others are known to extend them.

  4. Split-horizon DNS. Different answers are served depending on the querier’s network. Enterprise networks often run split-horizon configurations where internal users see private addresses and external users see public addresses.

“SERVFAIL from the resolver”

SERVFAIL is the most operationally frustrating response code because it is generic — it means “something went wrong” without specifying what. Common causes:

  • DNSSEC validation failure. The most common modern cause. An expired RRSIG, a missing DS record, or an algorithm mismatch will cause validating resolvers to return SERVFAIL. Use dig +cd example.com (checking disabled) to bypass validation and see the unsigned answer. If that works, the issue is DNSSEC.

  • Lame delegation. The NS records for a zone point to servers that don’t actually serve that zone. The resolver follows the delegation, gets a REFUSED or no response, and returns SERVFAIL.

  • Authoritative server timeout. The authoritative server is unreachable or too slow to respond. The resolver retries, times out, and returns SERVFAIL.

“Records changed but the old values persist”

This is a TTL issue. Before making DNS changes that need fast propagation:

  1. Lower TTLs in advance. 24–48 hours before the change, reduce TTLs on the records you plan to modify to 60–300 seconds. This ensures that by the time you make the change, most resolvers will have the short TTL cached and will re-query quickly.

  2. Make the change.

  3. Verify with multiple resolvers. Query Google (8.8.8.8), Cloudflare (1.1.1.1), and Quad9 (9.9.9.9) to confirm they all return the new value.

  4. Raise TTLs back up once the change is confirmed and stable.

Monitoring services

DNS-specific monitoring

DNSViz (dnsviz.net) — A web-based tool that visualizes the DNSSEC chain of trust for any domain. It shows every DNSKEY, DS, RRSIG, and NSEC record in the chain, highlights validation errors, and provides a graphical representation of the trust path from root to zone. Essential for DNSSEC deployment and troubleshooting.

DNS Checker (dnschecker.org) — Tests DNS propagation from dozens of global locations. Useful for verifying that a DNS change has propagated worldwide or identifying regional resolution inconsistencies.

DNSPerf (dnsperf.com) — Benchmarks public DNS resolver performance by region. Useful for choosing a resolver or verifying that your authoritative DNS provider meets performance expectations.

What to monitor

For authoritative DNS:

MetricWhy it mattersAlert threshold
Query latencyUser-facing performance> 50ms from your key regions
SERVFAIL rateResolution failures> 0.1% of total queries
NXDOMAIN rateMisconfiguration or abuseSudden increase from baseline
DNSSEC signature expiryExpired signatures cause global SERVFAILUnder 7 days before expiry
Zone serial propagationSecondaries out of syncSerial mismatch > 1 hour
Nameserver reachabilityDelegation healthAny NS unreachable

For recursive resolvers:

MetricWhy it mattersAlert threshold
Cache hit rateEfficiencyUnder 80% (typical is 90%+)
Upstream query latencyDependency on authoritative servers> 100ms average
SERVFAIL to clientsFailed resolutions> 1%
TCP fallback ratePotential fragmentation issuesSudden increase

Preventive practices

Test changes before applying. Use tools like named-checkzone (BIND) or provider-specific APIs to validate zone files before loading them. A syntax error in a zone file can take an entire zone offline.

Monitor DNSSEC expiry. DNSSEC signatures have a finite validity period. If your signing key rotation or re-signing process fails silently, signatures will expire and every validating resolver will return SERVFAIL for your domain. Automate monitoring of RRSIG expiry dates.

Keep multiple independent nameservers. RFC 2182 recommends at least two nameservers on different networks. If both your nameservers are on the same physical network or provider, a single network outage takes your zone offline.

Log and baseline. Establish baseline metrics for query volume, response times, and error rates. DNS issues often manifest as gradual degradation rather than sudden failure — a slowly increasing SERVFAIL rate or growing query latency may indicate an emerging problem before it becomes an outage.