Skill

promql

Writes, validates, and optimizes PromQL queries for Prometheus and Grafana Cloud Metrics. Use for metric queries, rate calculations, label aggregations, histogram quantiles, recording rules, and performance debugging.

Prometheus

monitoring

npx claudepluginhub grafana/skills --plugin grafana-app-sdk

Tool Access

This skill uses the workspace's default tool permissions.

Preview

PromQL is a functional query language for time series data. Every query returns either an

SKILL.md

Similar Skills

promql-generator

139

Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.

13 files

devops-skills

prometheus

Provides Prometheus and Grafana Cloud Metrics reference with PromQL queries, aggregations, alerting, recording rules, and drilldown for monitoring and metrics architecture.

grafana-app-sdk

promql-cli

CLI for querying Prometheus and PromQL-compatible engines (Thanos, Cortex, VictoriaMetrics, Grafana Mimir, Grafana Tempo...) — instant queries, range queries, metric discovery (metrics/labels/meta subcommands), output formats (table/csv/json/graph). Apply when executing PromQL queries, troubleshooting performance issues on a software having observability, investigating latency/error rates/saturation, or analyzing time series data.

6 files10 tools

cc-skills

Stats

Stars14

Forks1

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

PromQL Query Patterns

PromQL is a functional query language for time series data. Every query returns either an instant vector (one value per label set at a point in time), a range vector (a sliding window of samples), or a scalar.

Golden rule: rate() and increase() always require a range vector. The range must be at least 4x the scrape interval to avoid gaps. For a 60s scrape interval, use [5m] minimum.

Rate and counter queries

Rate (per-second average over a window):

rate(http_requests_total[5m])

Rate with label aggregation — "sum then rate" is wrong, always rate then sum:

# CORRECT: rate first, then aggregate
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)

# WRONG: sum first destroys the counter monotonicity
sum(http_requests_total) by (status_code)   -- do NOT then rate() this

Increase (total count over a window, not per-second):

increase(http_requests_total[1h])

irate vs rate:

rate() - smooth average over the full window. Use for dashboards and alerts.
irate() - instantaneous rate from the last two samples. Use only when you need to capture spikes that rate() would average away. Never use for alerting.

Filtering with label matchers

# Exact match
http_requests_total{job="api", status_code="200"}

# Regex match (anchored automatically)
http_requests_total{status_code=~"5.."}

# Negative regex
http_requests_total{status_code!~"2.."}

# Multiple values with regex OR
http_requests_total{env=~"staging|production"}

Aggregation operators

Always aggregate after rate():

# Sum across all instances, keep service label
sum(rate(http_requests_total[5m])) by (service)

# Average CPU per node, drop all other labels
avg(node_cpu_seconds_total{mode="idle"}) by (instance)

# 95th percentile request duration
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# Top 5 services by request rate
topk(5, sum(rate(http_requests_total[5m])) by (service))

# Count of distinct label values
count(count(up) by (job)) by ()

without vs by:

# Keep only the labels listed
sum(rate(http_requests_total[5m])) by (service, status_code)

# Drop only the labels listed, keep everything else
sum(rate(http_requests_total[5m])) without (instance, pod)

Histogram quantiles

Native histograms (Prometheus 2.40+) and classic histograms use different syntax.

Classic histogram (bucket metrics with _bucket suffix):

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)

Multi-service comparison:

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Common mistake: forgetting by (le) in the inner aggregation drops the bucket boundaries, making histogram_quantile produce wrong results or NaN.

Native histograms (simpler syntax):

histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))

Ratio and error rate

# Error ratio (errors / total)
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Success rate as percentage
(1 -
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# Avoid division by zero with or vector(0)
sum(rate(errors_total[5m]))
/
(sum(rate(requests_total[5m])) > 0)

Absence and staleness

# Alert when a metric disappears (e.g. a job stops reporting)
absent(up{job="api"})

# Alert when a metric value hasn't changed (potential stale exporter)
changes(up{job="api"}[5m]) == 0

# Check if a metric has been present in the last window
count_over_time(up{job="api"}[5m]) > 0

Time functions and offsets

# Compare current value to 1 hour ago
rate(http_requests_total[5m])
-
rate(http_requests_total[5m] offset 1h)

# Day-over-day comparison
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)

# Predict value in 2 hours based on current trend (linear regression)
predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)

Recording rules

Recording rules pre-compute expensive queries, improving dashboard load time and reducing Prometheus query load. Store them in a rules file loaded by Prometheus or Grafana Mimir.

groups:
  - name: http_request_rates
    interval: 1m
    rules:
      # Pre-compute per-service request rate
      - record: job:http_requests_total:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (job)

      # Pre-compute error ratio per service
      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # Pre-compute p95 latency per service (avoids expensive histogram_quantile on dashboards)
      - record: job:http_request_duration_p95:rate5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )

Naming convention: <aggregation_level>:<metric_name>:<operation_and_window>

SLO queries

# Availability SLO: fraction of successful requests over 30 days
1 - (
  sum(increase(http_requests_total{status_code=~"5.."}[30d]))
  /
  sum(increase(http_requests_total[30d]))
)

# Error budget burn rate (1h window, alerting when burning > 14.4x the allowed rate)
(
  sum(rate(http_requests_total{status_code=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
)
/
(1 - 0.999)   -- replace 0.999 with your SLO target

Cardinality and performance

High cardinality label values (UUIDs, user IDs, URLs) make queries slow and storage expensive.

# Find metrics with the most label combinations (run in Grafana Explore)
topk(10, count by (__name__)({__name__=~".+"}))

# Find series count for a specific metric
count(http_requests_total)

# Check label value cardinality
count(count by (user_id)(http_requests_total))

Rules for controllable cardinality:

Never put high-cardinality values (request IDs, user IDs, email addresses) in label values
Group URLs into route patterns: /api/users/123 → /api/users/{id}
Use relabel_configs to drop labels before ingestion

# Drop a high-cardinality label during scrape (in Alloy or Prometheus scrape config)
prometheus.scrape "api" {
  targets = [...]
  rule {
    source_labels = ["user_id"]
    action        = "labeldrop"
  }
}

Common patterns

Service availability (for use in alert rules):

avg_over_time(up{job="api"}[5m]) < 0.9

Saturation (resource near-full):

# Disk filling up (predict full in < 4h based on 1h trend)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

Throughput spike:

# Current rate > 3x the 1-hour average
rate(http_requests_total[5m])
>
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])

promql

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

promql

Tool Access

Preview

SKILL.md

PromQL Query Patterns

Rate and counter queries

Filtering with label matchers

Aggregation operators

Histogram quantiles

Ratio and error rate

Absence and staleness

Time functions and offsets

Recording rules

SLO queries

Cardinality and performance

Common patterns

References

Similar Skills

Help us improve

PromQL Query Patterns

Rate and counter queries

Filtering with label matchers

Aggregation operators

Histogram quantiles

Ratio and error rate

Absence and staleness

Time functions and offsets

Recording rules

SLO queries

Cardinality and performance

Common patterns

References