From diagnostics
Finds unused and rarely-queried metrics in VictoriaMetrics via metric_names_stats API and suggests drop rules, relabel configs to reduce storage and ingestion waste.
npx claudepluginhub victoriametrics/skills --plugin diagnosticsThis skill is limited to using the following tools:
Identify metrics that are being ingested but never (or rarely) queried, then recommend concrete actions to stop wasting storage and ingestion resources.
Analyzes VictoriaMetrics time series cardinality to identify unused metrics, high-cardinality labels, problematic values, and histogram bloat. Recommends relabeling configs and stream aggregations for optimization.
Queries VictoriaMetrics HTTP API via curl for PromQL/MetricsQL instant/range queries, label/series discovery, alerts/rules checks, TSDB status, raw data export, metric stats, and config debugging.
Reduces Grafana Cloud Metrics costs by analyzing Prometheus usage and generating Adaptive Metrics aggregation rules to manage cardinality and drop unused labels.
Share bugs, ideas, or general feedback.
Identify metrics that are being ingested but never (or rarely) queried, then recommend concrete actions to stop wasting storage and ingestion resources.
Uses the same env vars as the victoriametrics-query skill:
# $VM_METRICS_URL - base URL
# cluster: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"
# single: export VM_METRICS_URL="http://localhost:8428"
# $VM_AUTH_HEADER - auth header (empty if no auth is required)
VictoriaMetrics tracks how many times each metric name is fetched during queries, and when it was last fetched. This is done via the metric_names_stats API, available since v1.113.0. By comparing what's ingested against what's actually queried, you can find metrics that nobody uses and safely drop them.
The workflow has three phases:
Phase 1: Verify feature is enabled
Phase 2: Collect and analyze usage data
Phase 3: Report findings and suggest fixes
Before doing anything, check that the metric names stats tracker is returning data. The feature is enabled by default since v1.113.0, but may have been disabled or the instance may be too old.
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=3" | jq .
If the response has an empty records array or returns an error, the feature is not active. Guide the user:
The metric names stats tracker is not returning data. This can happen if:
- VictoriaMetrics version is older than v1.113.0 — upgrade to v1.113.0+
- Stats were recently reset — the tracker needs time to accumulate query data after a restart or reset
For cluster mode, this flag must be set on vmstorage nodes. See: https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#track-ingested-metrics-usage
Stop here if the feature is not available. The rest of the workflow depends on it.
If records exist, note the statsCollectedSince timestamp — this tells you how long the tracker has been accumulating data. Longer collection periods give more reliable results. Mention the collection window to the user so they can judge confidence:
These are metrics with queryRequests: 0 — tracked but never fetched by any query, dashboard, alert rule, or recording rule.
Use a high limit to capture all of them — the default 1000 is often not enough:
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=50000" | jq .
The le=0 parameter filters to metrics queried zero times. Save the full list of metric names from the response.
This is the critical step. The metric_names_stats tracker records ALL metric names it has ever seen, including ones from decommissioned sources that no longer have active time series. Metrics with 0 active series will expire naturally with retention — they are not wasting ingestion resources and need no action.
The goal is to find metrics that ARE being actively ingested but NOT being queried.
Important: A prefix group like container_* or node_* typically contains a mix of queried and never-queried metric names. For example, container_cpu_usage_seconds_total may be heavily queried while container_blkio_device_usage_total is never queried — but both share the container_ prefix. You must count series only for the specific never-queried metric names, not the entire prefix.
Approach: Group the never-queried metric names from Step 1 by prefix, then build a regex of only the never-queried names within each group to count their series:
# WRONG — counts ALL container metrics including queried ones:
# count({__name__=~"container_.+"})
# CORRECT — counts only the specific never-queried metric names:
# Build a regex from the actual never-queried names in that prefix group
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
--data-urlencode 'query=count({__name__=~"container_blkio_device_usage_total|container_file_descriptors|container_last_seen|container_memory_failcnt|container_sockets|container_tasks_state"})' \
"$VM_METRICS_URL/api/v1/query" | jq '.data.result[0].value[1]'
For prefix groups where ALL metric names in the group are never-queried (e.g., all go_godebug_* metrics), you can use the prefix regex since there's nothing queried to accidentally include.
Work through the major prefix groups from the Step 1 results. For each group:
count() returns null or 0 → these are historical-only metrics, note them but deprioritizecount() returns a positive number → these are actively ingested waste, prioritize for the reportDo NOT spend time individually querying hundreds of historical metrics with 0 series. Focus your effort on the groups that have active series — those are the ones worth reporting.
These are metrics queried only a handful of times — possibly from one-off exploration rather than active use.
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=50000" | jq .
Subtract the never-queried set to isolate the "queried 1-5 times" group.
This gives context for what percentage of metrics are unused and how much impact dropping them would have.
# Total tracked metric names
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | jq '.statsCollectedRecordsTotal'
# Total active time series
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
--data-urlencode 'query=count({__name__=~".+"})' \
"$VM_METRICS_URL/api/v1/query" | jq '.data.result[0].value[1]'
You should already have these numbers from Step 2. If any prefix groups are missing counts, query them now — remembering to count only the specific never-queried metric names within each group (see the WRONG vs CORRECT pattern in Step 2).
Group the active unused metrics into categories to make the report actionable. Common categories:
kube_*, container_*, node_* prefixes that aren't used in any dashboard or alert)go_*, process_*, promhttp_*)Separately note the historical-only metrics (0 active series) in a brief section — these need no action but show the user what will naturally expire.
Present findings in this format:
## Unused Metrics Report
**Collection window**: [statsCollectedSince date] to now ([N] days)
**Total tracked metric names**: [statsCollectedRecordsTotal]
**Total active time series**: [count]
**Never queried (with active series)**: [count] metrics, ~[M] series ([percentage]% of active series)
**Never queried (historical only, 0 series)**: [count] metrics — will expire with retention, no action needed
**Rarely queried (1-5 times)**: [count] metrics ([percentage]%)
**Queried long time ago (30+ days)**: [count] metrics ([percentage]%)
### Never-Queried Active Metrics by Category
#### [Category Name] — [N] metrics, ~[M] active series
| Metric Pattern | Series Count | Recommendation |
|---|---|---|
| `prefix_*` | ~X,000 | Drop via relabel |
### Rarely-Queried Metrics
[Similar table, but note last query time from lastRequestTimestamp]
### Historical-Only Metrics (No Action Needed)
[Brief list of prefix groups with 0 active series — these are from decommissioned sources and will expire with retention]
### Estimated Impact
- Dropping all never-queried active metrics would eliminate ~[N] series ([X]% of total active)
- Top 3 categories alone account for ~[M] series
For each category of unused metrics, suggest one of these actions:
1. Drop at vmagent via metric_relabel_configs (preferred for metrics you're confident are unused):
# Add to vmagent scrape config or VMServiceScrape/VMPodScrape
metric_relabel_configs:
- source_labels: [__name__]
regex: "metric_prefix_.+"
action: drop
2. Drop at vmagent via global -relabelConfig (for broad patterns across all scrape targets):
# vmagent -relabelConfig=relabel.yaml
- source_labels: [__name__]
regex: "go_memstats_.+|process_virtual_memory_.+|promhttp_.+"
action: drop
3. Keep but investigate (for rarely-queried metrics — someone might still need them):
These metrics were queried a few times in the collection window. Before dropping, verify they aren't used in:
- Grafana dashboards (search dashboard JSON for the metric name)
- Alert rules (check VMRule / PrometheusRule CRDs)
- Recording rules
- External systems querying the API
Always communicate these caveats:
drop rules on a staging vmagent to verify nothing breaks.queryRequests counts but are still essential. Cross-reference with api/v1/rules output.up or meta-metrics: Metrics like up, scrape_duration_seconds, scrape_samples_scraped are essential for monitoring health even if rarely queried directly.