Skill

prometheus

Prometheus instrumentation discipline: right metric type, right name, right labels. Invoke whenever task involves any interaction with Prometheus metrics — instrumenting application code, writing PromQL queries, defining alerting or recording rules, choosing metric types, managing label cardinality, building exporters, or reviewing monitoring configuration.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/backend:prometheus

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Choose the right metric type, name it clearly, label it sparingly. Prometheus is a pull-based monitoring system built on

Supporting Files

references/alerting-and-rules.mdreferences/exporters.mdreferences/instrumentation.mdreferences/metric-types.mdreferences/naming.mdreferences/promql.md

SKILL.md

282 lines · ~3.3k tokens

Stats

LanguageJavaScript

Parent stars11

MaintenanceExcellent

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Prometheus

Choose the right metric type, name it clearly, label it sparingly. Prometheus is a pull-based monitoring system built on a dimensional data model — every metric is a time series identified by a name and key-value label pairs. Getting this right at instrumentation time prevents expensive rework later.

References

Metric types — [${CLAUDE_SKILL_DIR}/references/metric-types.md] Extended type comparison, histogram bucket tuning, summary configuration
Naming — [${CLAUDE_SKILL_DIR}/references/naming.md] Full naming examples, base units table, character rules, label best practices
Instrumentation — [${CLAUDE_SKILL_DIR}/references/instrumentation.md] Code patterns per system type, library instrumentation, performance tuning
PromQL — [${CLAUDE_SKILL_DIR}/references/promql.md] Full operator catalog, vector matching, over-time aggregation, operator precedence
Alerting and rules — [${CLAUDE_SKILL_DIR}/references/alerting-and-rules.md] Alert design, recording rule naming, aggregation patterns, anti-patterns
Exporters — [${CLAUDE_SKILL_DIR}/references/exporters.md] Exporter architecture, collectors, help strings, push-based sources

Metric Type Selection

Choose correctly at instrumentation time — changing later requires migration of dashboards, alerts, and recording rules.

Question	Answer	Type
Can the value decrease?	No	Counter
Is it a snapshot of current state?	Yes	Gauge
Observing a distribution needing cross-instance aggregation?	Yes	Histogram
Need accurate quantiles from a single instance, known at instrumentation time?	Yes	Summary
None of the above	—	Gauge

Counter

Monotonically increasing value — resets to zero only on restart.

Use for: requests served, errors occurred, bytes transferred, tasks completed
API: inc(), inc(v) where v >= 0
Always suffix with _total: http_requests_total
Always apply rate() or increase() in queries — raw values are meaningless
Prometheus handles counter resets automatically in rate()
Never use a counter for values that can decrease — that is a gauge

Gauge

Value that goes up and down arbitrarily.

Use for: temperature, memory usage, in-progress requests, queue depth, timestamps
API: inc(), dec(), set(v), set_to_current_time()
No _total suffix
Never apply rate() to a gauge — use deriv() or delta()
Timestamp pattern: store Unix epoch seconds as myapp_last_success_timestamp_seconds; compute elapsed time with time() - metric in PromQL
Info metric pattern: myapp_build_info{version="1.2.3", commit="abc"} 1 — metadata as labels with constant value 1

Histogram

Samples observations into configurable buckets. Produces _bucket{le="..."}, _sum, _count.

Use for: request latencies, response sizes, any distribution needing percentiles or cross-instance aggregation
API: observe(v)
Buckets are cumulative — le="0.5" includes all observations <= 0.5
Must include +Inf bucket (equal to _count)
Choose buckets matching expected value range; place more buckets near SLO boundaries
Buckets cannot be changed after metric creation
Use histogram_quantile() in PromQL to calculate percentiles
Aggregatable across instances — the primary advantage over summary

Summary

Calculates streaming quantiles on the client side. Produces {quantile="..."}, _sum, _count.

Cannot be aggregated across instances — avg(x{quantile="0.95"}) is statistically invalid
_sum and _count without quantiles is a valid and useful configuration

Use summary over histogram only when ALL of these are true:

You need accurate quantiles (not approximate)
From a single instance (no cross-instance aggregation)
You know the exact quantiles at instrumentation time
You accept that adding new quantiles requires code changes

Default choice: histogram. See ${CLAUDE_SKILL_DIR}/references/metric-types.md for detailed comparison.

Naming

Format: <namespace>_<subsystem>_<name>_<unit>_<suffix>. Not all parts required — minimum is namespace + meaningful name + unit/suffix.

Naming Rules

Use snake_case — lowercase with underscores, matching [a-zA-Z_:][a-zA-Z0-9_:]*
Colons (:) are reserved for recording rules — never use in direct instrumentation
Double underscore prefix (__) is reserved for Prometheus internals
Every metric MUST have a namespace prefix identifying its origin
Always use base units — seconds not milliseconds, bytes not megabytes. Let visualization tools handle conversion.
Append unit to metric name in plural form: http_request_duration_seconds
Suffix counters with _total, counter-with-unit as _<unit>_total (e.g., process_cpu_seconds_total)
Suffix info metrics with _info, timestamps with _timestamp_seconds
A metric MUST represent the same logical thing across all its label dimensions — sum() or avg() across all dimensions should be meaningful. If nonsensical, split into separate metrics.

See ${CLAUDE_SKILL_DIR}/references/naming.md for base units table, component ordering, and full examples.

Labels

Use labels for dimensions you will filter or aggregate by: http_requests_total{method="GET", status="200"} — not separate metrics per status. Do not put label names in metric names.

When NOT to Use Labels

Unbounded values — user IDs, email addresses, full URLs, query strings
High cardinality — anything above ~100 unique values per metric

Cardinality

Every unique label combination is a new time series. Each costs RAM, CPU, disk, and network. Cardinality math: total series = metric cardinality x number of targets.

< 10: Safe for most metrics
10-100: Acceptable, monitor growth
100-1000: Investigate alternatives
> 1000: Move analysis out of Prometheus

Label Best Practices

Start with no labels. Add as concrete use cases emerge.
Keep cardinality below 10 per metric as a default target.
Initialize all label combinations you know upfront to avoid missing metrics — export 0 for known label sets.
Use stable label values. Avoid labels that change frequently.
Never include a "total" label value — rely on Prometheus sum().

Instrumentation Patterns

Online-Serving Systems (HTTP servers, APIs, databases)

Key metrics: request rate (_total), error rate, latency (histogram), in-progress (gauge).

Count requests at completion (not start) — aligns with error and latency stats
Always have a total requests counter alongside error counters (for ratio calculation)

Offline Processing (queues, pipelines, ETL)

Key metrics per stage: items in (_total), items out (_total), in progress (gauge), last processed timestamp (gauge), processing duration (histogram).

Export heartbeat timestamps to detect stalled processing

Batch Jobs (cron, scheduled tasks)

Key metrics (push to Pushgateway): last success timestamp (gauge), last completion timestamp (gauge), duration (gauge — single run, not distribution).

Batch job durations are gauges (single event), not histograms
Jobs running more often than every 15 minutes should be converted to daemons

Libraries

Instrument transparently — users get metrics without configuration. Minimum for external resource access: request count (counter), error count (counter), latency (histogram).

Subsystem Patterns

Logging: maintain log_messages_total{level="..."} counter per log level
Failures: always pair failure counter with total attempts counter for ratio calculation
Caches: cache_requests_total{result="hit|miss"}, evictions (counter), size (gauge), lookup latency (histogram). Also instrument the downstream system.

See ${CLAUDE_SKILL_DIR}/references/instrumentation.md for threadpool patterns, custom collectors, and performance tuning in hot paths.

PromQL

Rate and Increase (Counters Only)

rate(counter[5m]) — per-second rate. Use for alerts and dashboards.
increase(counter[5m]) — total increase. Sugar for rate() * range_seconds.
irate(counter[5m]) — instant rate from last two samples. Only for graphing volatile counters.
rate() first, then aggregate: sum(rate(x[5m])), never rate(sum(x)[5m]). Rate must see individual counter resets.
Never rate() a gauge — use deriv() or delta().

Histogram Quantiles

histogram_quantile(0.95, rate(metric_bucket[5m])) — single histogram
When aggregating histogram buckets, always preserve le in the by clause: histogram_quantile(0.95, sum by (job, le) (rate(metric_bucket[5m])))
Average duration: rate(metric_sum[5m]) / rate(metric_count[5m])

PromQL Gotchas

Staleness: most recent sample within lookback period (default 5 min). Series disappears if not scraped within that window.
Rate window size: rate() needs at least 2 samples. Range should be at least 4x scrape interval. With 15s scrape, use rate(x[5m]) or wider.
Expensive queries: bare metric names can expand to thousands of series. Always filter or aggregate before graphing. Use recording rules for expensive expressions.

See ${CLAUDE_SKILL_DIR}/references/promql.md for data types, selectors, aggregation operators, vector matching, over-time aggregation, binary operators, and operator precedence.

Alerting Rules

Alert on symptoms (user-visible impact), not causes. Use dashboards to pinpoint causes after an alert fires.

Online-serving — High latency, high error rate (user-facing, high in the stack)
Offline processing — Data taking too long to get through the system
Batch jobs — Job has not succeeded recently enough (>= 2x normal cycle)
Capacity — Approaching resource limits that will cause outage without intervention

Only page on latency at one point in the stack — if overall user latency is fine, don't page on a slow sub-component. Avoid noisy alerts — if an alert fires and there's nothing to do, remove it.

See ${CLAUDE_SKILL_DIR}/references/alerting-and-rules.md for alert design, naming conventions, and recording rule details.

Recording Rules

Pre-compute frequently used or expensive expressions. Format: level:metric:operations.

Aggregate ratios correctly — aggregate numerator and denominator separately, then divide. Never average a ratio or average an average.
Use without for aggregation — preserves all labels except those being removed. Prefer over by.
Use recording rules for dashboard queries that are expensive and queried frequently, expressions used in multiple alerts, or complex multi-step aggregations.

See ${CLAUDE_SKILL_DIR}/references/alerting-and-rules.md for full naming convention and recording rule anti-patterns.

Exporters

Write an exporter when the target system does not expose Prometheus metrics natively. For your own code, use a client library directly.

Prefix all metrics with exporter name: haproxy_up, mysql_global_status_threads_connected
Create fresh metric instances per scrape — do NOT use global metric variables updated each scrape (race conditions, stale labels)
Drop pre-computed rates, min/max since start, stddev from source systems — export raw counters and current values; let Prometheus rate() handle the rest

See ${CLAUDE_SKILL_DIR}/references/exporters.md for architecture, collectors, help strings, label rules, and push-based sources.

Application

When writing Prometheus instrumentation:

Apply all conventions silently — don't narrate each rule being followed.
Choose metric types based on the decision criteria above.
If an existing codebase contradicts a convention, follow the codebase and flag the divergence once.

When writing PromQL queries:

Always wrap counters in rate() or increase() before further operations.
Prefer without over by for aggregation.

When writing alerting or recording rules:

Follow level:metric:operations naming.
Alert on symptoms, not causes.
Aggregate ratios correctly (numerator and denominator separately).

When reviewing Prometheus code:

Cite the specific violation and show the fix inline.
Don't lecture — state what's wrong and how to fix it.

Integration

The coding skill governs workflow; this skill governs Prometheus implementation choices.

prometheus

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

prometheus

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Prometheus

References

Metric Type Selection

Counter

Gauge

Histogram

Summary

Naming

Naming Rules

Labels

When NOT to Use Labels

Cardinality

Label Best Practices

Instrumentation Patterns

Online-Serving Systems (HTTP servers, APIs, databases)

Offline Processing (queues, pipelines, ETL)

Batch Jobs (cron, scheduled tasks)

Libraries

Subsystem Patterns

PromQL

Rate and Increase (Counters Only)

Histogram Quantiles

PromQL Gotchas

Alerting Rules

Recording Rules

Exporters

Application

Integration

Similar Skills

Prometheus

References

Metric Type Selection

Counter

Gauge

Histogram

Summary

Naming

Naming Rules

Labels

When NOT to Use Labels

Cardinality

Label Best Practices

Instrumentation Patterns

Online-Serving Systems (HTTP servers, APIs, databases)

Offline Processing (queues, pipelines, ETL)

Batch Jobs (cron, scheduled tasks)

Libraries

Subsystem Patterns

PromQL

Rate and Increase (Counters Only)

Histogram Quantiles

PromQL Gotchas

Alerting Rules

Recording Rules

Exporters

Application

Integration

Similar Skills