Structured workflows for investigating production issues in Honeycomb — the sequence of tool calls (context priming, broad query, BubbleUp, trace analysis, verification) and how to chain results between steps to reach root causes. Trigger phrases: "investigate production issue", "debug latency spike", "find root cause", "use BubbleUp", "analyze traces", "debug an outage", "why is my API slow", "errors are increasing", "health check", "SLO burning", or any request to investigate or debug production problems.
From honeycombnpx claudepluginhub honeycombio/agent-skill --plugin honeycombThis skill uses the workspace's default tool permissions.
references/bubbleup-guide.mdreferences/investigation-playbooks.mdreferences/trace-exploration.mdExecutes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.
This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.
get_workspace_context → environments and datasetsget_slos → any SLOs in violation? (frames severity)get_triggers → any alerts firing? (narrows scope)find_queries → has anyone investigated this before?Run a broad query to see the shape of the issue:
Also call get_service_map — it shows P95 durations between services and can immediately reveal which dependency is slow.
This is the highest-value step. Once you have a query showing the anomaly:
run_bubbleup on the query result, selecting the outlier regionHow to interpret BubbleUp results:
deployment.version=v2.3.1 is 90% of slow requests but only 20% of baseline)db.query_duration is much higher in outliers)After BubbleUp identifies suspects:
get_trace to fetch the full traceWhat to look for in the trace waterfall:
Form a hypothesis from BubbleUp + trace analysis, then confirm:
Call create_board with:
HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries
COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify
P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify
get_service_map → P99 on the slow dependency → relational query (any.service.name) to measure user impact → trace an affected request
If you find yourself reasoning any of these, follow the workflow anyway:
find_columns, expand time range, verify environment/dataset${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links