From diagnostics
Analyzes VictoriaMetrics query trace JSON to diagnose slow queries and produce structured performance reports with time breakdowns, bottleneck analysis, and optimization recommendations.
npx claudepluginhub victoriametrics/skills --plugin diagnosticsThis skill uses the workspace's default tool permissions.
You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.
Analyzes VictoriaMetrics time series cardinality to identify unused metrics, high-cardinality labels, problematic values, and histogram bloat. Recommends relabeling configs and stream aggregations for optimization.
Queries VictoriaMetrics HTTP API via curl for PromQL/MetricsQL instant/range queries, label/series discovery, alerts/rules checks, TSDB status, raw data export, metric stats, and config debugging.
Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.
Share bugs, ideas, or general feedback.
You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.
In Cluster mode two components are involved in query processing:
Single-node mode runs everything in one process. The trace structure is similar but without RPC wrappers.
You can tell which mode you're looking at from the root message in trace:
vmselect-<version>: /select/...,/victoria-metrics-<version>: /api/v1/....When you add trace=1 to a VictoriaMetrics HTTP query, it returns a JSON tree describing every internal operation.
Each node looks like this:
{
"duration_msec": 123.456,
"message": "description of what happened",
"children": [ ... ]
}
The tree is rooted at vmselect. It captures the full query execution pipeline: parsing, series search, data fetch from storage, rollup computation, aggregation, and response generation.
Before manually reading the trace file, run the analysis script to extract structured data:
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
This outputs: root info, trace tree (depth 3), key nodes with durations, per-vmstorage RPC breakdown, and computed totals (bytes, samples, series). Use this output as your primary data source for the report.
Additional subcommands for deeper investigation:
python3 <script> <trace> tree --depth N — print the trace tree to depth Npython3 <script> <trace> nodes --pattern "fetch unique" — find all nodes matching a substringOnly drill deeper if the summary output reveals ambiguities or missing information.
After running the summary, also check for relevant performance improvements in newer VictoriaMetrics versions:
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
Where <version> is the semver from the parse script output (e.g., v1.130.0) and <mode> is cluster or single-node. This fetches changelogs from GitHub and shows performance-relevant fixes/features in versions newer than what the trace was captured on. If the fetch fails, skip this section gracefully.
Read the trace JSON file the user provides (or use the script output from Step 0). The root node tells you the big picture. Extract:
/api/v1/query (instant) or /api/v1/query_range (range)query=start=, end=, step= (for range queries)series= at the endduration_msecWalk the top-level children and classify each into one of these phases. Not every trace has all of them — just report what's there.
For large traces, focus on the top-level children first. Drill into subtrees only when they are relevant to the bottleneck or when durations are surprising.
A query trace typically has these phases, roughly in this order. Not all phases appear in every trace. Identify them by matching the message patterns described here.
Expression evaluation — nodes matching: eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N
These trace the recursive PromQL/MetricsQL expression tree.
These trace the recursive evaluation of the PromQL/MetricsQL expression tree.
Each eval node may have children for sub-expressions. Key numbers:
Functions and aggregations — nodes matching:
transform <func>(): series=N — PromQL functions (histogram_quantile, clamp, etc.)aggregate <func>(): series=N — aggregation operators (sum, avg, max, etc.)binary op "<op>": series=N — binary operationsSeries search (index lookup) — where label matchers get resolved to internal series IDs:
rpc at vmstorage <addr> → rpc call search_v7(), in Single-node - appears directly without RPC wrappersinit series search,search TSIDs,search N indexDBs in parallel — parallel index database search,search indexDB — individual index partition,found N metric ids for filter=... — metric ID, unique time series identifier within vmstorage,found N TSIDs for N metricIDs — same as metric ID,sort N TSIDssearch for metricIDs in tag filters cache followed by cache miss or a cache hit (no cache miss child)put N metricIDs in cache / stored N metricIDs into cacheData fetch — getting raw data:
fetch matching series: ... wraps RPC calls to each vmstorage node:
rpc at vmstorage <addr> — per-node RPC,sent N blocks to vmselect — amount of raw data transmitted back,fetch unique series=N, blocks=N, samples=N, bytes=N — aggregate summary across all vmstorage nodes,search for parts with data for N series followed by data scan messages.
The bytes value in fetch unique series tells you total data transferred and is a good indicator of I/O load.Rollup computation — computing rate(), increase(), avg_over_time(), etc.:
rollup <func>(): timeRange=..., step=N, window=Nrollup <func>() with incremental aggregation <agg>() over N series — this is an optimizationthe rollup evaluation needs an estimated N bytes of RAM for N series and N points per series — memory estimateparallel process of fetched data: series=N, samples=N — the actual computation over raw samplesseries after aggregation with <func>(): N; samplesScanned=N — post-aggregation result
This phase often dominates execution time for queries that scan large amounts of raw data.Response generation — usually trivial:
sort series by metric name and labelsgenerate /api/v1/query_range response for series=N, points=N
Usually trivially fast. Could be a bottleneck if response is huge (hundreds of series and thousands of datapoints per-series) and client's speed on reading the response is slow.For each phase, note the duration_msec.
In Cluster traces, the same phases repeat for each vmstorage node — aggregate for the summary but also track per-node numbers to spot imbalances.
Identify which phase consumed the most time and explain why in concrete terms. For instance, "The rollup scanned 212M raw samples" is useful; "the query was slow" is not.
Base recommendations only on what the trace actually shows. If the query is fast and healthy, say so — don't invent problems.
Follow this algorithm to select recommendations:
## Query Overview
- **Query:** `<the PromQL/MetricsQL expression>`
- **Type:** instant / range query
- **Time range:** <start> to <end> (<duration>)
- **Step:** <step>
- **Result:** <N> series, <N> points
- **Version:** vmselect or VM single-node version
## Performance Summary
- **Total duration:** <N>ms
- **Duration score:** <Fast / Acceptable / Slow / Very Slow>
- **Matched series:** <N> (across all storage nodes)
- **Raw samples scanned:** <N>
- **Bytes transferred:** <N>
"Duration score" thresholds:
- Fast: < 500ms
- Acceptable: 500ms–5s
- Slow: 5s–10s
- Very Slow: > 10s
## Execution Time Breakdown
| Phase | Duration | % of Total | Notes |
|-------|----------|------------|-------|
| Series search (index) | Xms | X% | |
| Data fetch | Xms | X% | |
| Rollup computation | Xms | X% | |
| Aggregation / functions | Xms | X% | |
| Response generation | Xms | X% | |
(Adapt the phases to what actually appears in the trace.
For cluster traces, break down data fetch per storage node.)
## Storage Node Breakdown (cluster only)
| Node | Series | Bytes sent | Duration |
|------|--------|-------------|----------|
| vmstorage-1 | N | N | Xms |
| vmstorage-2 | N | N | Xms |
## Bottleneck Analysis
Name the single biggest contributor to total query time. Explain why it's slow with specific numbers from the trace.
## Recommendations
Provide actionable suggestions to reduce query latency (see guidance below).
## Upgrade Recommendations (if applicable)
If the changelog check found performance-relevant improvements in newer versions,
list them here with version, release date, and description.
Only include this section if there are concrete relevant entries. Omit entirely otherwise.
duration_msec values directly from the trace — don't estimate durations by subtractionduration_msec, and drill into those.subquery or @ modifier evaluation — treat these like nested eval phases.mayCache=false in eval messages is informational, not a problemCRITICAL: Pattern selection rules
Base recommendations on what the trace actually shows sorted by priority. Here are common patterns and the corresponding advice:
High series cardinality (many matched series)
Large raw sample scan (high samplesScanned)
[window] or too agressive subqueries.rate()/irate()/increase() suggest shorter [window] if semantically acceptableSlow index lookups (series search dominates)
Slow data fetch / high bytes transferred (cluster)
Slow data fetch / high bytes transferred (Single-node)
Rollup computation dominates (often caused by scanning millions of raw samples)
step to reduce points per series.Version upgrade opportunities
If the check_changelog.py script found relevant performance improvements in newer versions, mention the upgrade as an additional recommendation in the "Upgrade Recommendations" report section. This is the ONE exception to the "single pattern only" rule — upgrade recommendations are supplementary and can be appended regardless of which bottleneck pattern was selected. Only include entries that are directly relevant to the observed bottleneck or the components involved in the trace.