From diagnostics
Guides debugging of Kubernetes applications and alerts using VictoriaMetrics metrics, VictoriaLogs, VictoriaTraces via 4-phase protocol with subagents.
npx claudepluginhub victoriametrics/skills --plugin diagnosticsThis skill is limited to using the following tools:
Random querying wastes time and produces misleading results. Empty results from wrong metric names look identical to "no problem exists." Jumping between signals without a hypothesis leads to thrashing.
Correlates traces, logs, and metrics using OTel semantic conventions like traceId/spanId in OpenSearch and Prometheus for end-to-end observability investigations from metric spikes to error logs.
Queries Dynatrace logs, metrics, traces, problems, vulnerabilities, and Kubernetes events using DQL. Converts natural language to DQL, explains queries, and accesses Davis AI insights.
Provides patterns for observability strategies covering logs, metrics, traces, and signal correlation. Use when designing monitoring systems or implementing the three pillars.
Share bugs, ideas, or general feedback.
Random querying wastes time and produces misleading results. Empty results from wrong metric names look identical to "no problem exists." Jumping between signals without a hypothesis leads to thrashing.
Discover before you query. Hypothesize before you correlate. Confirm before you conclude.
If you haven't completed Phase 1, you cannot propose root causes. If you haven't correlated across at least two signal types, your conclusion is a guess.
Complete each phase before proceeding to the next.
Phase 1: Gather Signals → What's already known? What's alerting?
Phase 2: Discover and Scope → What data exists? What are the real names?
Phase 3: Hypothesize and Test → Form one theory, query to confirm or refute
Phase 4: Correlate and Confirm → Cross-reference across signal types, find root cause
Before writing any query, establish what's already known.
1. Check env var availability — run the gating check from the Subagent Dispatch section.
2. Dispatch signal-gathering subagents in parallel:
| Subagent | Condition | What it does |
|---|---|---|
| AlertManager check | VM_ALERTMANAGER_URL available | Checks VM alerts + AlertManager alerts and silences |
| Metrics discovery (alerts only) | VM_METRICS_URL available AND VM_ALERTMANAGER_URL NOT available | Checks VM alerts as fallback when AlertManager agent can't be dispatched |
If VM_ALERTMANAGER_URL IS available, the AlertManager check agent handles BOTH VM alerts and AlertManager queries — no need to dispatch a separate metrics agent for alerts.
Read the agent prompt files and dispatch in a single Agent tool call. Include in each subagent's prompt:
3. Synthesize results — once subagents return:
4. Identify which signal type to start with:
| Symptom | Start with | Then correlate with |
|---|---|---|
| Resource/rate issue | Metrics | Logs |
| Errors/crashes | Logs | Metrics |
| Latency/slow requests | Traces | Logs |
| Alert firing | Metrics (alert details) | Logs + Traces |
Never guess metric names, log field names, or service names. Discovery is not optional — it prevents the single most common investigation failure: drawing conclusions from empty results caused by wrong names.
Dispatch discovery subagents in parallel for ALL available backends. Read each agent prompt file and dispatch in a single Agent tool call. Include in each subagent's prompt:
| Subagent | Condition |
|---|---|
| Metrics discovery | VM_METRICS_URL available |
| Logs discovery | VM_LOGS_URL available |
| Traces discovery | VM_TRACES_URL available |
Synthesize discovery results:
Consult skill references for complex queries. You do NOT know LogsQL syntax from training data — it is NOT Loki LogQL. For complex queries beyond what the subagents already ran, invoke the corresponding *-query skill or use the LogsQL Quick Reference below.
After discovery, form a specific hypothesis before querying further.
State it clearly: "I think [component X] is [failing/slow/OOM] because [evidence Y from Phase 1]."
Test minimally:
If the hypothesis is wrong:
After 3 failed hypotheses: STOP. Three wrong guesses means you're missing something fundamental. Either:
A single signal type is not proof. Correlate across at least two before concluding.
Dispatch correlation subagents in parallel for the signal types you need. Reuse the same agent prompt files from agents/, but provide specific queries rather than discovery tasks. Include in each subagent's prompt:
Example parallel dispatch for correlation:
rate(http_requests_total{code=~'5..', namespace='myapp'}[5m]) from T1 to T2"{namespace='myapp'} error from T1 to T2, return sample messages"myapp with minDuration=1s from T1 to T2"Correlation techniques:
trace_id:"<id>"Only after correlation: propose root cause and remediation.
If you catch yourself:
All of these mean: STOP. You're guessing, not investigating.
| Excuse | Reality |
|---|---|
| "I know the metric name" | Maybe. Discovery takes 2 seconds and prevents 20 minutes of chasing empty results. |
| "Alerts won't help here" | Alerts are free to check and frequently contain the exact answer. Skip at your peril. |
| "Just need to check logs quickly" | Quick log checks without discovery produce wrong field names and misleading results. |
| "Empty results = no problem" | Empty results more often mean wrong query than absent problem. Verify names first. |
| "I'll correlate later" | Single-signal conclusions are guesses. Correlate before claiming root cause. |
| "LogsQL is like LogQL/Elasticsearch" | It's not. The syntax differences cause silent failures. Consult the reference. |
Environment is controlled by env vars. Check current state:
echo "VM_METRICS_URL: $VM_METRICS_URL"
echo "VM_LOGS_URL: $VM_LOGS_URL"
echo "VM_TRACES_URL: $VM_TRACES_URL"
echo "VM_ALERTMANAGER_URL: $VM_ALERTMANAGER_URL"
if [ -n "${VM_AUTH_HEADER-}" ]; then
echo "VM_AUTH_HEADER: (set)"
else
echo "VM_AUTH_HEADER: (empty - no auth)"
fi
If unsure which environment the application runs in, ask user.
This skill dispatches parallel subagents at phase boundaries to speed up investigations. Each subagent carries embedded API reference and returns structured findings.
Before each dispatch round, check which backends are available:
echo "METRICS:${VM_METRICS_URL:+available}"
echo "LOGS:${VM_LOGS_URL:+available}"
echo "TRACES:${VM_TRACES_URL:+available}"
echo "ALERTMANAGER:${VM_ALERTMANAGER_URL:+available}"
Only dispatch subagents for backends that report available. Do not dispatch a subagent if its env var is empty or unset.
agents/ directory (relative to this skill's directory)allowed-tools: Bash(curl:*) on each subagent| Agent | File | Requires | Used in |
|---|---|---|---|
| AlertManager check | agents/alertmanager-check.md | VM_ALERTMANAGER_URL + VM_METRICS_URL | Phase 1 |
| Metrics discovery | agents/metrics-discovery.md | VM_METRICS_URL | Phase 2, 4 |
| Logs discovery | agents/logs-discovery.md | VM_LOGS_URL | Phase 2, 4 |
| Traces discovery | agents/traces-discovery.md | VM_TRACES_URL | Phase 2, 4 |
victoriametrics-query = Metrics only (MetricsQL/PromQL) → $VM_METRICS_URL
victorialogs-query = Logs only (LogsQL) → $VM_LOGS_URL
victoriatraces-query = Traces only (Jaeger API) → $VM_TRACES_URL
alertmanager-query = AlertManager (silences, routing) → $VM_ALERTMANAGER_URL
Never cross APIs between backends. Use the correct env var and endpoint for each data type.
AlertManager provides what VM alerts cannot: silences and inhibition state. But it's in-cluster and may be down — fall back to $VM_METRICS_URL/api/v1/alerts when unavailable.
| Backend | Parameter | Format | Example |
|---|---|---|---|
| VictoriaMetrics | start/end | RFC3339 string | 2026-02-06T09:00:00Z |
| VictoriaLogs | start (REQUIRED), end | RFC3339 string | 2026-02-06T09:00:00Z |
| VictoriaTraces | start/end | Unix microseconds NUMBER | 1738836000000000 (16 digits) |
| VictoriaTraces (dependencies) | endTs/lookback | Unix milliseconds NUMBER | 1738836000000 (13 digits) / 3600000 |
VictoriaLogs start is always required — omitting it scans ALL stored data (extremely expensive).
Follow this order for each signal type. For full API details and additional endpoints, invoke the corresponding query skill.
victoriametrics-query skill$VM_METRICS_URL/api/v1/metadata?metric=<keyword>&limit=10$VM_METRICS_URL/api/v1/label/<label_name>/values (filter with match[])$VM_METRICS_URL/api/v1/series?limit=20 with match[]={namespace="X"}api/v1/query or range at api/v1/query_range (range requires start, RFC3339)victorialogs-query skillALL VictoriaLogs endpoints require start (RFC3339). Use --data-urlencode for the query parameter.
$VM_LOGS_URL/select/logsql/stream_field_names?start=<RFC3339>$VM_LOGS_URL/select/logsql/stream_field_values?start=<RFC3339>&field=namespace$VM_LOGS_URL/select/logsql/facets?start=<RFC3339>$VM_LOGS_URL/select/logsql/field_names?start=<RFC3339>$VM_LOGS_URL/select/logsql/query?start=<RFC3339>&limit=100victoriatraces-query skillTrace discovery endpoints accept NO time-range parameters:
$VM_TRACES_URL/api/services$VM_TRACES_URL/api/services/<service>/operations$VM_TRACES_URL/api/dependencies?endTs=<ms>&lookback=3600000service required, times in Unix microseconds, 16 digits): $VM_TRACES_URL/api/traces?service=<svc>&start=<µs>&end=<µs>&limit=20For full LogsQL syntax, invoke the victorialogs-query skill. Key points:
|.{namespace="myapp"}{namespace="myapp"} error(error OR warning), Regex: ~"err|warn", Field-specific: level:error_time:1h (alternative to API start/end params — use one OR the other, never both)-"expected error"| stats by (level) count() as totalCommon mistakes: | grep does NOT exist (use word filters or ~"regex"). | filter is valid ONLY after | stats. Stream field names depend on ingestion config — discover them first.
minDuration filter to find slow spansderiv() or increase() to quantify growth rate| Mistake | Fix |
|---|---|
| Guessing metric names | Use metadata endpoint: $VM_METRICS_URL/api/v1/metadata?metric=keyword |
| Writing LogsQL from memory | Consult LogsQL Quick Reference above or victorialogs-query skill |
| Wrong timestamp format | See Timestamp Formats table above |
| Skipping alerts check | Query $VM_METRICS_URL/api/v1/alerts first — it's free |
| Empty results → "no problem" | Verify metric/field names exist via discovery first |
Not using facets for log exploration | facets returns field distributions in one call |
| Not URL-encoding queries | Use --data-urlencode 'query=...' for POST requests |
Missing start on VictoriaLogs | Omitting start scans ALL data (extremely expensive) |
Forgetting match[] needs [] | match alone won't work — must be match[] |
| Wrong timestamp type for traces | Search uses MICROSECONDS (16 digits), dependencies use MILLISECONDS (13 digits) |
Confusing stats_query vs stats_query_range | Instant uses time, range uses start/end/step |
Mixing _time: filter with API start | Use one OR the other, never both |
| Searching "error" catching vmselect noise | Add -"vm_slow_query_stats" to exclude PromQL text |
Grouping logs by cluster field | Vector logs lack cluster stream field — use kubernetes.pod_namespace |
| Blocking on AlertManager failure | Use VM alerts as primary, AlertManager as best-effort |
| Single-signal conclusion | Correlate across at least two signal types before claiming root cause |