From honeycomb
Guides structured Honeycomb workflows for production issue investigations: orient with context/SLOs/triggers, broad queries/service maps, BubbleUp differentiators, trace analysis to find root causes like latency spikes or error surges.
How this skill is triggered — by the user, by Claude, or both
Slash command
/honeycomb:production-investigationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Structured workflows for debugging production issues. The MCP tools document their
Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.
This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.
get_workspace_context → environments and datasetsget_slos → any SLOs in violation? (frames severity)get_triggers → any alerts firing? (narrows scope)find_queries → has anyone investigated this before?Run a broad query to see the shape of the issue:
Also call get_service_map — it shows P95 durations between services and can immediately reveal which dependency is slow.
This is the highest-value step. Once you have a query showing the anomaly:
run_bubbleup on the query result, selecting the outlier regionHow to interpret BubbleUp results:
deployment.version=v2.3.1 is 90% of slow requests but only 20% of baseline)db.query_duration is much higher in outliers)After BubbleUp identifies suspects:
get_trace to fetch the full traceWhat to look for in the trace waterfall:
Form a hypothesis from BubbleUp + trace analysis, then confirm:
Call create_board with:
HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries
COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify
P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify
get_service_map → P99 on the slow dependency → relational query (any.service.name) to measure user impact → trace an affected request
If you find yourself reasoning any of these, follow the workflow anyway:
find_columns, expand time range, verify environment/dataset${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and linksnpx claudepluginhub honeycombio/agent-skill --plugin honeycombGuides Honeycomb queries on trace/event datasets: percentiles over AVG, HEATMAP distributions, relational fields (root.,any.,none.), calculated fields, query math, result interpretation (P99/P50, heatmaps). For latency, errors, outliers, slow requests.
Guides debugging of Kubernetes applications and alerts using VictoriaMetrics metrics, VictoriaLogs, VictoriaTraces via 4-phase protocol with subagents.
Analyzes production errors in distributed systems, performs root-cause analysis on incidents, and recommends observability, logging, and tracing improvements.