From grafana-app-sdk
Writes, validates, and optimizes PromQL queries for Prometheus and Grafana Cloud Metrics. Use for metric queries, rate calculations, label aggregations, histogram quantiles, recording rules, and performance debugging.
npx claudepluginhub grafana/skills --plugin grafana-app-sdkThis skill uses the workspace's default tool permissions.
PromQL is a functional query language for time series data. Every query returns either an
Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.
Provides Prometheus and Grafana Cloud Metrics reference with PromQL queries, aggregations, alerting, recording rules, and drilldown for monitoring and metrics architecture.
CLI for querying Prometheus and PromQL-compatible engines (Thanos, Cortex, VictoriaMetrics, Grafana Mimir, Grafana Tempo...) — instant queries, range queries, metric discovery (metrics/labels/meta subcommands), output formats (table/csv/json/graph). Apply when executing PromQL queries, troubleshooting performance issues on a software having observability, investigating latency/error rates/saturation, or analyzing time series data.
Share bugs, ideas, or general feedback.
PromQL is a functional query language for time series data. Every query returns either an instant vector (one value per label set at a point in time), a range vector (a sliding window of samples), or a scalar.
Golden rule: rate() and increase() always require a range vector. The range must be at
least 4x the scrape interval to avoid gaps. For a 60s scrape interval, use [5m] minimum.
Rate (per-second average over a window):
rate(http_requests_total[5m])
Rate with label aggregation — "sum then rate" is wrong, always rate then sum:
# CORRECT: rate first, then aggregate
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)
# WRONG: sum first destroys the counter monotonicity
sum(http_requests_total) by (status_code) -- do NOT then rate() this
Increase (total count over a window, not per-second):
increase(http_requests_total[1h])
irate vs rate:
rate() - smooth average over the full window. Use for dashboards and alerts.irate() - instantaneous rate from the last two samples. Use only when you need to capture
spikes that rate() would average away. Never use for alerting.# Exact match
http_requests_total{job="api", status_code="200"}
# Regex match (anchored automatically)
http_requests_total{status_code=~"5.."}
# Negative regex
http_requests_total{status_code!~"2.."}
# Multiple values with regex OR
http_requests_total{env=~"staging|production"}
Always aggregate after rate():
# Sum across all instances, keep service label
sum(rate(http_requests_total[5m])) by (service)
# Average CPU per node, drop all other labels
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
# 95th percentile request duration
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Top 5 services by request rate
topk(5, sum(rate(http_requests_total[5m])) by (service))
# Count of distinct label values
count(count(up) by (job)) by ()
without vs by:
# Keep only the labels listed
sum(rate(http_requests_total[5m])) by (service, status_code)
# Drop only the labels listed, keep everything else
sum(rate(http_requests_total[5m])) without (instance, pod)
Native histograms (Prometheus 2.40+) and classic histograms use different syntax.
Classic histogram (bucket metrics with _bucket suffix):
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)
Multi-service comparison:
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Common mistake: forgetting by (le) in the inner aggregation drops the bucket boundaries,
making histogram_quantile produce wrong results or NaN.
Native histograms (simpler syntax):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))
# Error ratio (errors / total)
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Success rate as percentage
(1 -
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Avoid division by zero with or vector(0)
sum(rate(errors_total[5m]))
/
(sum(rate(requests_total[5m])) > 0)
# Alert when a metric disappears (e.g. a job stops reporting)
absent(up{job="api"})
# Alert when a metric value hasn't changed (potential stale exporter)
changes(up{job="api"}[5m]) == 0
# Check if a metric has been present in the last window
count_over_time(up{job="api"}[5m]) > 0
# Compare current value to 1 hour ago
rate(http_requests_total[5m])
-
rate(http_requests_total[5m] offset 1h)
# Day-over-day comparison
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)
# Predict value in 2 hours based on current trend (linear regression)
predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)
Recording rules pre-compute expensive queries, improving dashboard load time and reducing Prometheus query load. Store them in a rules file loaded by Prometheus or Grafana Mimir.
groups:
- name: http_request_rates
interval: 1m
rules:
# Pre-compute per-service request rate
- record: job:http_requests_total:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute error ratio per service
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute p95 latency per service (avoids expensive histogram_quantile on dashboards)
- record: job:http_request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
Naming convention: <aggregation_level>:<metric_name>:<operation_and_window>
# Availability SLO: fraction of successful requests over 30 days
1 - (
sum(increase(http_requests_total{status_code=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
# Error budget burn rate (1h window, alerting when burning > 14.4x the allowed rate)
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/
(1 - 0.999) -- replace 0.999 with your SLO target
High cardinality label values (UUIDs, user IDs, URLs) make queries slow and storage expensive.
# Find metrics with the most label combinations (run in Grafana Explore)
topk(10, count by (__name__)({__name__=~".+"}))
# Find series count for a specific metric
count(http_requests_total)
# Check label value cardinality
count(count by (user_id)(http_requests_total))
Rules for controllable cardinality:
/api/users/123 → /api/users/{id}relabel_configs to drop labels before ingestion# Drop a high-cardinality label during scrape (in Alloy or Prometheus scrape config)
prometheus.scrape "api" {
targets = [...]
rule {
source_labels = ["user_id"]
action = "labeldrop"
}
}
Service availability (for use in alert rules):
avg_over_time(up{job="api"}[5m]) < 0.9
Saturation (resource near-full):
# Disk filling up (predict full in < 4h based on 1h trend)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
Throughput spike:
# Current rate > 3x the 1-hour average
rate(http_requests_total[5m])
>
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])