From posthog
Investigates server/infrastructure metric anomalies in PostHog Metrics to find probable causes with evidence. Correlates metrics with logs and traces for root-cause analysis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/posthog:investigating-metric-anomaliesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The job: go from a metric symptom ("ingestion lag is rising") to a probable cause with evidence, fast. The metric tells you _what_ and _when_; logs and traces tell you _why_. Follow the loop below — it front-loads the cheap, high-information calls and only fans out when the blast radius is unclear.
The job: go from a metric symptom ("ingestion lag is rising") to a probable cause with evidence, fast. The metric tells you what and when; logs and traces tell you why. Follow the loop below — it front-loads the cheap, high-information calls and only fans out when the blast radius is unclear.
If you have the exact metric name, skip ahead. Otherwise call metric-names-list with a substring from the symptom (lag, error, latency, queue). The returned metric_type decides the lens: counters (sum) are only meaningful as rate/increase, gauges as avg, histograms as histogram_quantile.
Call characterize-metric-anomaly with the metric name and anomalyFrom (the alert fire time, or when the user says it started looking wrong; subtract some margin if unsure). It compares against the preceding window by default and answers:
direction, change_ratio, anomaly_peak vs baseline_mean. If direction is flat, your window or metric is wrong — widen the window, or compare against the same window yesterday via baselineFrom/baselineTo (daily-pattern metrics often look "anomalous" against the immediately-preceding hours).onset_time — treat this timestamp as the pivot for everything that follows.top_movers — label values whose behavior changed. One mover (a single pod, shard, or endpoint) means a localized culprit; everything moving together means a shared cause (an upstream dependency, a deploy, infra).Use query-metrics to test the hypotheses the report raises:
filters pinning the suspicious label value, grouped by a second key, to localize further (pod → container, endpoint → status code).clauses + formula (errors / requests) to separate rate changes from volume changes.interval so the grids align visually.Pivot into logs and traces using the same service and a window bracketing onset_time (a few buckets before, through the peak):
query-logs (follow its own discover-first workflow) filtered to the implicated service.name and window, severity error first, then warn. Restarts, crash loops, connection errors, and deploy markers right before onset are the classic causes. Widen to other services in the request path if the service's own logs are clean.query-apm-spans etc.) for the same service/window — slow or erroring spans show which dependency degraded, and a trace_id from an exemplar or log line links a concrete request across all three signals.State: the symptom (metric, magnitude, onset), the probable cause (what you found in logs/traces and how its timing aligns with the onset), the blast radius (which services/labels are affected, from the movers and grouped queries), and the confidence level. If the cause is still ambiguous, say which hypothesis the evidence favors and what would disambiguate (e.g. "the lag began draining at 20:12 — consistent with a consumer restart; check who restarted it").
metric-names-list with value: "lag" → logs_rate_limiter_message_lag_seconds (histogram) and friends.characterize-metric-anomaly on it with anomalyFrom = alert time → direction up, change ratio 40x, onset_time 20:10, top mover service_name = logs-ingestion (the other services' lag stayed flat) — so the logs consumer specifically is behind, not the whole pipeline.query-metrics: rate of the consumer's throughput counter over the same window → throughput was zero during the gap and spiked after onset: the consumer wasn't slow, it was down, and the "rising lag" is it draining the backlog.query-logs for service.name = logs-ingestion (and its neighbors) around 20:00–20:15 → process exit + restart lines at the gap boundaries.rate/increase already handle this; never eyeball raw cumulative counter values.series with top_movers showing a vanished label value means the emitter died; pivot to logs immediately.avg can hide a screaming p95. For latency-like gauges and histograms, characterize the tail too.npx claudepluginhub anthropics/claude-plugins-official --plugin posthogGuides structured Honeycomb workflows for production issue investigations: orient with context/SLOs/triggers, broad queries/service maps, BubbleUp differentiators, trace analysis to find root causes like latency spikes or error surges.
Guides debugging of Kubernetes applications and alerts using VictoriaMetrics metrics, VictoriaLogs, VictoriaTraces via 4-phase protocol with subagents.
Monitors PostHog log ingestion for volume bursts, severity shifts, service silence, and trace-correlated anomalies. Emits findings only when confidence is high.