Help us improve
Share bugs, ideas, or general feedback.
From grafana-app-sdk
Write, validate, and optimise PromQL queries for Prometheus and Grafana Cloud Metrics. Covers rates, aggregations, histogram quantiles, recording rules, and query debugging.
npx claudepluginhub grafana/skills --plugin grafana-coreHow this skill is triggered — by the user, by Claude, or both
Slash command
/grafana-app-sdk:promqlThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
PromQL is a functional query language for time series data. Every query returns either an
Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.
Prometheus instrumentation discipline: right metric type, right name, right labels. Invoke whenever task involves any interaction with Prometheus metrics — instrumenting application code, writing PromQL queries, defining alerting or recording rules, choosing metric types, managing label cardinality, building exporters, or reviewing monitoring configuration.
Provides PromQL reference, alerting setup, recording rules, and Grafana Cloud Metrics integration patterns for Prometheus monitoring.
Share bugs, ideas, or general feedback.
PromQL is a functional query language for time series data. Every query returns either an instant vector (one value per label set at a point in time), a range vector (a sliding window of samples), or a scalar.
Golden rule: rate() and increase() always require a range vector. The range must be at
least 4x the scrape interval to avoid gaps. For a 60s scrape interval, use [5m] minimum.
Rate (per-second average over a window):
rate(http_requests_total[5m])
Rate with label aggregation — "sum then rate" is wrong, always rate then sum:
# CORRECT: rate first, then aggregate
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)
# WRONG: sum first destroys the counter monotonicity
sum(http_requests_total) by (status_code) -- do NOT then rate() this
Increase (total count over a window, not per-second):
increase(http_requests_total[1h])
irate vs rate:
rate() - smooth average over the full window. Use for dashboards and alerts.irate() - instantaneous rate from the last two samples. Use only when you need to capture
spikes that rate() would average away. Never use for alerting.# Exact match
http_requests_total{job="api", status_code="200"}
# Regex match (anchored automatically)
http_requests_total{status_code=~"5.."}
# Negative regex
http_requests_total{status_code!~"2.."}
# Multiple values with regex OR
http_requests_total{env=~"staging|production"}
Always aggregate after rate():
# Sum across all instances, keep service label
sum(rate(http_requests_total[5m])) by (service)
# Average CPU per node, drop all other labels
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
# 95th percentile request duration
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Top 5 services by request rate
topk(5, sum(rate(http_requests_total[5m])) by (service))
# Count of distinct label values
count(count(up) by (job)) by ()
without vs by:
# Keep only the labels listed
sum(rate(http_requests_total[5m])) by (service, status_code)
# Drop only the labels listed, keep everything else
sum(rate(http_requests_total[5m])) without (instance, pod)
Native histograms (Prometheus 2.40+) and classic histograms use different syntax.
Classic histogram (bucket metrics with _bucket suffix):
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)
Multi-service comparison:
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Common mistake: forgetting by (le) in the inner aggregation drops the bucket boundaries,
making histogram_quantile produce wrong results or NaN.
Native histograms (simpler syntax):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))
# Error ratio (errors / total)
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Success rate as percentage
(1 -
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Avoid division by zero with or vector(0)
sum(rate(errors_total[5m]))
/
(sum(rate(requests_total[5m])) > 0)
# Alert when a metric disappears (e.g. a job stops reporting)
absent(up{job="api"})
# Alert when a metric value hasn't changed (potential stale exporter)
changes(up{job="api"}[5m]) == 0
# Check if a metric has been present in the last window
count_over_time(up{job="api"}[5m]) > 0
# Compare current value to 1 hour ago
rate(http_requests_total[5m])
-
rate(http_requests_total[5m] offset 1h)
# Day-over-day comparison
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)
# Predict value in 2 hours based on current trend (linear regression)
predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)
Recording rules pre-compute expensive queries, improving dashboard load time and reducing Prometheus query load. Store them in a rules file loaded by Prometheus or Grafana Mimir.
groups:
- name: http_request_rates
interval: 1m
rules:
# Pre-compute per-service request rate
- record: job:http_requests_total:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute error ratio per service
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute p95 latency per service (avoids expensive histogram_quantile on dashboards)
- record: job:http_request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
Naming convention: <aggregation_level>:<metric_name>:<operation_and_window>
# Availability SLO: fraction of successful requests over 30 days
1 - (
sum(increase(http_requests_total{status_code=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
# Error budget burn rate (1h window, alerting when burning > 14.4x the allowed rate)
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/
(1 - 0.999) -- replace 0.999 with your SLO target
High cardinality label values (UUIDs, user IDs, URLs) make queries slow and storage expensive.
# Find metrics with the most label combinations (run in Grafana Explore)
topk(10, count by (__name__)({__name__=~".+"}))
# Find series count for a specific metric
count(http_requests_total)
# Check label value cardinality
count(count by (user_id)(http_requests_total))
Rules for controllable cardinality:
/api/users/123 → /api/users/{id}relabel_configs to drop labels before ingestion# Drop a high-cardinality label during scrape (in Alloy or Prometheus scrape config)
prometheus.scrape "api" {
targets = [...]
rule {
source_labels = ["user_id"]
action = "labeldrop"
}
}
Service availability (for use in alert rules):
avg_over_time(up{job="api"}[5m]) < 0.9
Saturation (resource near-full):
# Disk filling up (predict full in < 4h based on 1h trend)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
Throughput spike:
# Current rate > 3x the 1-hour average
rate(http_requests_total[5m])
>
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])