Help us improve
Share bugs, ideas, or general feedback.
From grimoire
Systematically isolates network latency sources across DNS, TCP, routing, TLS, and packet loss using ping, traceroute, dig, curl, ss, and iperf3.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireHow this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:diagnose-network-latencyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematically isolate the source of network latency across the application stack to identify whether the cause is DNS, TCP, routing, application, or infrastructure.
Diagnoses network latency using curl, ping, traceroute; optimizes request patterns via parallelization, batching, connection pooling for faster API calls.
Calculates and allocates latency budgets for systems, breaking down end-to-end targets into component budgets, identifying bottlenecks, and providing optimization recommendations. Useful for meeting latency SLAs.
Profiles system performance to identify bottlenecks using measurement before optimization — follows Brendan Gregg's USE Method and Google pprof methodology.
Share bugs, ideas, or general feedback.
Systematically isolate the source of network latency across the application stack to identify whether the cause is DNS, TCP, routing, application, or infrastructure.
Adopted by: Google SRE methodology (USE/RED method); Netflix performance engineering; Cloudflare network diagnostics playbook Impact: Network latency contributes to 47% of web performance issues (Cloudflare 2022); structured diagnosis reduces mean time to identify (MTTI) from hours to minutes Why best: Latency has many sources; random investigation wastes time; systematic layer-by-layer elimination pinpoints root cause efficiently
Sources: Brendan Gregg "Systems Performance" 2nd ed. (2020); Google SRE Workbook Ch. 4; RFC 6349 (2011)
Quantify and characterize the latency — Collect p50, p95, p99 latency for the affected request type over the past 24 hours. Determine: Is it consistent or intermittent? Affects all users or specific regions? Correlates with time of day, traffic volume, or deployments? This narrows the hypothesis space before any tooling.
Isolate the network layer — Use ping and traceroute to establish baseline RTT to the target: ping -c 100 <host> (watch for jitter and packet loss). traceroute -n <host> shows hop-by-hop latency. A high-latency hop that doesn't change in subsequent hops is a red herring — focus on the hop where latency first increases.
Measure DNS resolution time — DNS adds latency on every new connection: dig +stats <hostname> shows query time. dig @8.8.8.8 +stats <hostname> tests external resolver. TTL values below 60 seconds increase resolver round trips. Slow DNS (>50 ms) is common and often overlooked.
Analyze TCP connection setup — TCP handshake latency = 1 RTT. Capture with: curl -w "%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}\n" -o /dev/null -s <url>. High time_connect vs time_namelookup indicates routing or firewall inspection latency, not application latency.
Check for packet loss and retransmissions — ss -s shows retransmit counts. netstat -s | grep retransmit. TCP retransmits cause 200-3000 ms latency spikes (RTO timer). Packet loss of 1% can cause 10% throughput loss on bulk transfers. Use iperf3 to measure bandwidth and packet loss.
Profile TLS handshake overhead — TLS adds 1-2 RTT per new connection. openssl s_client -connect <host>:443 -debug shows handshake timing. TLS session resumption and HTTP/2 connection reuse eliminate per-request TLS overhead. Check if clients are reusing connections.
Identify bandwidth saturation — On the server: sar -n DEV 1 10 shows interface utilization. nload or iftop shows real-time bandwidth. Interface saturation causes queuing latency that adds 10-100 ms. On cloud instances, check network credit exhaustion (T-series AWS, e2-micro GCP).
Examine receive and transmit buffers — Small TCP buffers limit throughput: sysctl net.core.rmem_max net.core.wmem_max. For high-bandwidth long-latency paths (BDP > 4 MB), default 256 KB buffers are the bottleneck. Apply RFC 6349 buffer sizing: BDP = bandwidth × RTT.
Correlate with infrastructure metrics — Check: CPU steal time (noisy neighbor on shared hosts), NIC driver errors (ethtool -S <iface>), VPC Flow Log drops (AWS: REJECT actions in flow logs), and load balancer error rates. Dropped packets in the hypervisor layer appear as jitter in the guest OS.
Test end-to-end with synthetic monitoring — Deploy probes from multiple geographic locations using tools like Blackbox Exporter, Synthetic Monitoring (Grafana Cloud), or Catchpoint. Reproduce from outside your network to distinguish client-side from server-side latency. Compare internal vs external measurements.