Skill

signals-scout-apm

Monitors PostHog APM/OpenTelemetry spans for RED metric regressions (error rate, p95 latency, volume) per service/operation against seasonality-matched baselines, plus new error signatures and failing dependencies.

monitoring

backend

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/posthog:signals-scout-apm

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are a focused APM scout. Spot meaningful regressions in this team's OpenTelemetry trace

SKILL.md

256 lines · ~3.8k tokens

Stats

LanguagePython

Stars55

Forks8

MaintenanceExcellent

Last CommitJun 22, 2026

Actions

View Source View Plugin View on GitHub View README

Signals scout: distributed tracing (APM)

You are a focused APM scout. Spot meaningful regressions in this team's OpenTelemetry trace data — error-rate steps, latency regressions, new error signatures, failing dependencies, service traffic cliffs — and emit findings only when they clear the confidence bar. An empty findings list is a real outcome; re-emitting a known regression is worse than emitting nothing.

This is APM / distributed tracing, not AI observability and not logs. Ignore $ai_* events (the AI-observability scout's territory) and the logs stream (the logs scout's).

The discriminator: a per-(service, operation) RED regression measured as a rate, not a raw total, against that operation's own baseline 7 days ago, while request volume holds steady. Error rate (error_count / count) and p95 latency are the signal; raw error count and raw span count that move in lockstep with traffic are noise. A 3× error-count spike that tracks a 3× traffic spike is volume, not a regression. Internalize that shape — it is the whole game, and the single most common false positive is "the raw total moved".

Quick close-out: is APM even in use?

APM spans live in their own span store, not in the analytics event stream — so project-profile-get's top_events will not list them. Use the APM tools to check:

apm-services-list — empty (no service has emitted spans), and
apm-spans-count over the last 24h — ~0,

→ this team isn't using distributed tracing. Write one scratchpad entry:

key: not-in-use:apm:team{team_id}
content: brief note ("checked at {timestamp}, apm-services-list empty, 0 spans 24h")

Close out empty. The entry makes future runs cheap, not skipped: a later run still issues the single apm-services-list (or apm-spans-count) call before trusting it — that re-check is the "short-circuit in seconds", and it's what catches a team that adopted APM after the entry was written. Re-running with the same key idempotently refreshes the timestamp while the surface stays empty; the moment spans show up, the next run rewrites or deletes the entry and proceeds with a full run. Never close out on the memory alone.

How a run works

Cycle between these moves; skip what's not useful, revisit what is. Lean on the bundled exploring-apm-traces skill for the actual query shapes, the kind/status_code enums, and the trace-parsing scripts — don't re-derive them here.

Get oriented

Three cheap reads cold-start a run:

signals-scout-scratchpad-search (text=apm) — durable steering from past APM runs. Entries with pattern:, noise:, addressed:, or dedupe: prefixes tell you the per-operation baselines, what's normal, what's already surfaced, and what to skip (deploy windows, health-check endpoints, retry-prone dependencies).
signals-scout-runs-list (last 7d) — what prior APM runs found and ruled out. Skim summaries; pull signals-scout-runs-retrieve only for one worth drilling into.
apm-services-list — the live service inventory. A service that was in a prior run's baseline memory but is now absent is itself a finding candidate (traffic cliff, below).

The discriminator engine

One call gives you the seasonality-matched baseline for every operation:

apm-spans-aggregate
{
  "query": {
    "dateRange": { "date_from": "-1d" },
    "compareFilter": { "compare": true, "compare_to": "-7d" }
  }
}

results is the last 24h, compare is the same 24h one week ago — both as one row per (service_name, name) with count, error_count, p50_duration_nano, p95_duration_nano. Join the two arrays on (service_name, name) and compute, per operation:

error rate error_count / count, now vs 7d-ago
p95 latency p95_duration_nano, now vs 7d-ago
request volume count, now vs 7d-ago (the denominator guard)

A busy service returns hundreds of operations (the payload runs to 100KB+ and the harness persists it to a file) — process it programmatically, don't eyeball it. Sort operations by delta and keep only those where the rate moved but count stayed within ~2× (the guard that separates a real regression from a volume swing); a low-count operation has too small a sample for a stable percentile (see disqualifiers). Scope to a few services per run rather than pulling the whole project at once.

Profile shape

Pattern	What it usually means
`error_count` up, `count` up proportionally (rate ~flat)	traffic spike — not a regression, skip
`error_count` up, `count` ~flat (error rate steps up)	real error regression — investigate first
`p95` up materially, `count` ~flat	latency regression — investigate
`p95` up and `count` up sharply	saturation under load — investigate, lower confidence
new `(service, name)` erroring, no 7d-ago row	new code path / recent deploy — investigate
service in baseline memory, now ~0 spans	traffic cliff (instrumentation break or outage) — investigate

Always score the latest complete bucket/window — a partial current hour always reads as a drop in volume and a dip in p95.

Explore

Patterns to watch — starting points, not a checklist.

Error-rate regression

From the discriminator engine, find operations where error rate stepped up materially while count held roughly steady. Confirm when it started: apm-spans-sparkline with your service/operation filters for total counts, then the same call with statusCodes: [2] for error counts — error rate per bucket = errors / total; the bucket where the ratio jumps is the onset. Pull a representative failing trace: query-apm-spans with a status_code = 2 filter and orderBy: "duration", grab a trace_id, then apm-trace-get and read exception.type / exception.message straight off the error span's attributes map. Walk parent_span_id up to see the request path that led there. query-apm-spans defaults to root spans only (rootSpans: true), so when the regressed operation is a child span (a DB or Client call), set flatSpans: true (and rootSpans: false) or the status_code = 2 + operation filter matches nothing — the aggregate flags the regression but you can never pull a sample to confirm it.

Latency p95 regression

Find operations where p95_duration_nano stepped up with steady count. Localize the cause: apm-spans-tree exposes per-(parent, child) edges — read calls_per_parent_invocation to separate a child that got slower per call from one that merely runs more times per parent. On a sample slow trace, sort spans by self_time_nano: a parent with a large self-time gap is uninstrumented work, not a slow child. apm-spans-duration-histogram reveals a second hump or fat tail = a distinct slow population worth isolating with a duration filter — but it buckets root-span duration only (root scoping is unconditional), so reserve it for root-operation latency; for a child-span regression use apm-spans-tree and query-apm-spans (flatSpans: true) instead.

When several operations in the same service (or sharing a subsystem — e.g. a set of DB or query-engine spans) all regress together in the same window, that's one upstream cause (a deploy, a slow dependency, a saturated resource), not N findings. Recognize the cluster and emit a single finding naming the shared cause with the operations as evidence, rather than one emit per operation.

New error signature / failing dependency

An operation (or a downstream Client-kind span calling another service) newly erroring. Scope to the error set (status_code = 2) and run apm-attribute-breakdown on candidate keys — server.address, http.response.status_code, db.system, service.version. Scoped to the error set, the breakdown only describes the bad population, so it can't tell a real signature from a value that's simply everywhere: rerun the same breakdown without the status_code filter and compare shares. A value at ~95% of errors but a small share of total traffic is the signature; one at ~95% of both is just volume. A service.version that owns the errors but not the traffic points at a bad deploy.

Service traffic cliff

Compare apm-services-list and per-service apm-spans-sparkline against baseline memory: a service that emitted a steady span volume and dropped to ~0 is an instrumentation break or an outage (the trace-side analog of a capture cliff — spans are not retroactive). Guard against reading a partial current bucket as a cliff: confirm the drop spans ≥2 complete buckets.

Save memory as you go

Write a scratchpad entry whenever you observe something a future run should know. Encode the category in the key prefix — pattern: / noise: / addressed: / dedupe:. Domain label apm.

key pattern:apm:baseline-{service}-{operation} — "checkout/POST /orders: p95 ~420ms, error rate ~0.3%, ~1.2k req/h at this hour-of-week (2026-06-21)"
key dedupe:apm:{service}:{operation}:{date} — "2026-06-21: surfaced p95 regression on payments/charge (320ms→1.4s, count steady ~800/h) starting 14:00 UTC. If still elevated next run, escalate; if back under ~400ms, treat as already-surfaced/recovered."
key noise:apm:{service} — "frontend/GET /healthz: high-volume readiness probe, ignore; deploy-window p95 blips recover within one bucket, don't emit unless sustained ≥2 buckets."

Decide

Emit via signals-scout-emit-signal above the bar. Strong finding: confidence ≥ 0.85 with the concrete (service, operation), before/after numbers (rate or p95, with the steady denominator), and the onset bucket in the evidence. Quantify the hook ("p95 320ms → 1.4s over a steady ~800 req/h") and explain the shape that rules out a volume explanation. Cross-check inbox-reports-list first so you don't duplicate an open report.
Remember if real but below 0.65, or worth carrying forward (a fresh baseline, a blip to watch).
Skip if a noise: / addressed: / dedupe: entry already covers it, or a prior run emitted the same regression with no material change. A regression that escalated since a prior run → emit fresh and cite the prior finding_id.
Bundle correlated operations. When a cluster of operations in one service / subsystem regressed together, emit one finding for the shared cause, not one per operation — an inbox full of six findings for the same slow deploy is noise.

Suggested dedupe_keys: apm_error_regression:{service}:{operation}, apm_latency_regression:{service}:{operation}, apm_traffic_cliff:{service}. Severity: P1 for an active error-rate regression hitting many requests, P2 for a contained latency regression, P3 for a single-dependency or low-traffic operation.

Close out

One paragraph: which services/operations you scored, what regressed and was emitted, what you remembered (baselines, blips), what you ruled out (volume-tracking spikes, deploy blips, dev services). "Looked but found nothing meaningful" is a real outcome. Don't write a separate "run metadata" scratchpad entry — this summary already serves that role.

Disqualifiers (skip these)

Raw count tracking traffic. Error or span count up in lockstep with request count (rate ~flat) — volume, not a regression. This is the dominant false positive; check it first.
Deploy-window blips. A one-bucket p95 or error spike that recovers on its own. Record a noise:/pattern: entry; emit only when sustained across ≥2 complete buckets.
High-but-steady error baselines. An operation erroring at the same elevated rate in both windows (e.g. ~98% now and ~98% a week ago) is a standing baseline, not a fresh regression — record it once in pattern:/noise: memory and don't re-flag it each run. The signal is the rate stepping up, not its absolute level.
Dev / test services. service.name or a resource attribute (deployment.environment, env) of dev / local / test / staging. Filter before weighing.
Health-check / readiness endpoints. /health, /healthz, /ready, /livez and the like — high volume, low signal. Allowlist them in memory.
Cold-start / low-traffic noise. A p95 jump on an operation with a tiny count (n too small for a stable percentile) is usually a cold start or a single slow trace, not a trend.
Transient client retries. A Client span that errors but whose parent ultimately succeeds (retry succeeded) — don't emit unless the failure rate itself is climbing.
Single-trace anomalies. One slow or error trace with no recurrence across the window.
Known upstream provider / DB errors already covered by memory — re-emit only if the rate or shape changed meaningfully.

When in doubt, write memory instead of emitting.

MCP tools

Direct (read-only): apm-services-list, apm-spans-aggregate, apm-spans-sparkline, apm-spans-tree, apm-spans-duration-histogram, apm-attribute-breakdown, apm-attributes-list, apm-attribute-values-list, apm-spans-count, query-apm-spans, apm-trace-get, inbox-reports-list. Harness-level: signals-scout-project-profile-get, signals-scout-scratchpad-search, signals-scout-runs-list, signals-scout-runs-retrieve, signals-scout-emit-signal, signals-scout-scratchpad-remember, signals-scout-scratchpad-forget. Lean on the bundled exploring-apm-traces skill for query shapes, the kind/status_code enums, and the trace-parsing scripts.

signals-scout-apm

Popularity

Invocation

Context Preview

SKILL.md

signals-scout-apm

Popularity

Invocation

Context Preview

SKILL.md

Signals scout: distributed tracing (APM)

Quick close-out: is APM even in use?

How a run works

Get oriented

The discriminator engine

Profile shape

Explore

Error-rate regression

Latency p95 regression

New error signature / failing dependency

Service traffic cliff

Save memory as you go

Decide

Close out

Disqualifiers (skip these)

MCP tools

Similar Skills

Signals scout: distributed tracing (APM)

Quick close-out: is APM even in use?

How a run works

Get oriented

The discriminator engine

Profile shape

Explore

Error-rate regression

Latency p95 regression

New error signature / failing dependency

Service traffic cliff

Save memory as you go

Decide

Close out

Disqualifiers (skip these)

MCP tools

Similar Skills