Skill

production-investigation

Guides structured Honeycomb workflows for production issue investigations: orient with context/SLOs/triggers, broad queries/service maps, BubbleUp differentiators, trace analysis to find root causes like latency spikes or error surges.

monitoring

performance

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/honeycomb:production-investigation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Structured workflows for debugging production issues. The MCP tools document their

Supporting Files

references/bubbleup-guide.mdreferences/investigation-playbooks.mdreferences/trace-exploration.md

SKILL.md

119 lines · ~1.6k tokens

Stats

LanguagePython

Parent stars10

Parent forks1

MaintenanceGood

Last CommitMar 12, 2026

Actions

View Source View Plugin View on GitHub View README

Honeycomb Production Investigation

Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.

The Core Analysis Loop

This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.

Investigation Workflow

Step 1: Orient

get_workspace_context → environments and datasets
get_slos → any SLOs in violation? (frames severity)
get_triggers → any alerts firing? (narrows scope)
find_queries → has anyone investigated this before?

Step 2: Characterize the Problem

Run a broad query to see the shape of the issue:

Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
Error surge: COUNT filtered on error=true, grouped by exception.message or service
Unknown: COUNT grouped by service.name to find which service has anomalous volume

Also call get_service_map — it shows P95 durations between services and can immediately reveal which dependency is slow.

Step 3: BubbleUp to Find Differentiators

This is the highest-value step. Once you have a query showing the anomaly:

Run run_bubbleup on the query result, selecting the outlier region
BubbleUp compares outlier vs baseline distributions across all columns automatically
Look for fields where the distributions differ significantly

How to interpret BubbleUp results:

Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g., deployment.version=v2.3.1 is 90% of slow requests but only 20% of baseline)
Numeric fields (measures): A shifted distribution shows correlated metrics (e.g., db.query_duration is much higher in outliers)
Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag

Step 4: Drill Into Traces

After BubbleUp identifies suspects:

Add BubbleUp findings as WHERE filters to narrow results
Pick a representative trace ID
Call get_trace to fetch the full trace

What to look for in the trace waterfall:

Spans with disproportionate duration vs parent (the bottleneck)
Sequential spans that could be parallelized (N+1 query patterns)
Error spans — check span events for stack traces
Gaps between child spans (missing instrumentation or idle wait)
Service boundaries (where the trace crosses services)

Step 5: Verify Hypothesis

Form a hypothesis from BubbleUp + trace analysis, then confirm:

Query WITH the suspected cause filtered in
Query WITHOUT it (as a control)
If the metrics diverge, you've found it

Step 6: Record Findings

Call create_board with:

A text panel summarizing the root cause (Markdown)
The key query run PKs that identified the problem
Related SLOs if applicable

Investigation Patterns

Latency Spike

HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries

Error Surge

COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify

Deployment Regression

P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify

Dependency Failure

get_service_map → P99 on the slow dependency → relational query (any.service.name) to measure user impact → trace an affected request

Stay on the Path

If you find yourself reasoning any of these, follow the workflow anyway:

"The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
"I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
"Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
"This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.

When Results Are Empty or Unclear

No results: Check field names with find_columns, expand time range, verify environment/dataset
BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split

Additional Resources

Reference Files

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links

Cross-References

For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
For query construction patterns, see the query-patterns skill
For SLO/trigger context during investigations, see the slos-and-triggers skill

production-investigation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

production-investigation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Honeycomb Production Investigation

The Core Analysis Loop

Investigation Workflow

Step 1: Orient

Step 2: Characterize the Problem

Step 3: BubbleUp to Find Differentiators

Step 4: Drill Into Traces

Step 5: Verify Hypothesis

Step 6: Record Findings

Investigation Patterns

Latency Spike

Error Surge

Deployment Regression

Dependency Failure

Stay on the Path

When Results Are Empty or Unclear

Additional Resources

Reference Files

Cross-References

Similar Skills

Honeycomb Production Investigation

The Core Analysis Loop

Investigation Workflow

Step 1: Orient

Step 2: Characterize the Problem

Step 3: BubbleUp to Find Differentiators

Step 4: Drill Into Traces

Step 5: Verify Hypothesis

Step 6: Record Findings

Investigation Patterns

Latency Spike

Error Surge

Deployment Regression

Dependency Failure

Stay on the Path

When Results Are Empty or Unclear

Additional Resources

Reference Files

Cross-References

Similar Skills