Prometheus instrumentation discipline: right metric type, right name, right labels. Invoke whenever task involves any interaction with Prometheus metrics — instrumenting application code, writing PromQL queries, defining alerting or recording rules, choosing metric types, managing label cardinality, building exporters, or reviewing monitoring configuration.
Guides Prometheus instrumentation, metric design, and query writing following best practices for type selection, naming, and labeling.
npx claudepluginhub xobotyi/cc-foundryThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/alerting-and-rules.mdreferences/exporters.mdreferences/instrumentation.mdreferences/metric-types.mdreferences/naming.mdreferences/promql.mdChoose the right metric type, name it clearly, label it sparingly. Prometheus is a pull-based monitoring system built on a dimensional data model — every metric is a time series identified by a name and key-value label pairs. Getting this right at instrumentation time prevents expensive rework later.
| Topic | Reference | Contents |
|---|---|---|
| Metric types | references/metric-types.md | Extended type comparison, histogram bucket tuning, summary configuration |
| Naming | references/naming.md | Full naming examples, base units table, character rules, label best practices |
| Instrumentation | references/instrumentation.md | Code patterns per system type, library instrumentation, performance tuning |
| PromQL | references/promql.md | Full operator catalog, vector matching, over-time aggregation, operator precedence |
| Alerting and rules | references/alerting-and-rules.md | Alert design, recording rule naming, aggregation patterns, anti-patterns |
| Exporters | references/exporters.md | Exporter architecture, collectors, help strings, push-based sources |
Choose correctly at instrumentation time — changing later requires migration of dashboards, alerts, and recording rules.
| Question | Answer | Type |
|---|---|---|
| Can the value decrease? | No | Counter |
| Is it a snapshot of current state? | Yes | Gauge |
| Observing a distribution needing cross-instance aggregation? | Yes | Histogram |
| Need accurate quantiles from a single instance, known at instrumentation time? | Yes | Summary |
| None of the above | — | Gauge |
Monotonically increasing value — resets to zero only on restart.
inc(), inc(v) where v >= 0_total: http_requests_totalrate() or increase() in queries — raw values are meaninglessrate()Value that goes up and down arbitrarily.
inc(), dec(), set(v), set_to_current_time()_total suffixrate() to a gauge — use deriv() or delta()myapp_last_success_timestamp_seconds;
compute elapsed time with time() - metric in PromQLmyapp_build_info{version="1.2.3", commit="abc"} 1 — metadata
as labels with constant value 1Samples observations into configurable buckets. Produces _bucket{le="..."}, _sum,
_count.
observe(v)le="0.5" includes all observations <= 0.5+Inf bucket (equal to _count)histogram_quantile() in PromQL to calculate percentilesCalculates streaming quantiles on the client side. Produces {quantile="..."}, _sum,
_count.
avg(x{quantile="0.95"}) is statistically
invalid_sum and _count without quantiles is a valid and useful configurationUse summary over histogram only when ALL of these are true:
Default choice: histogram. See references/metric-types.md for detailed comparison.
Format: <namespace>_<subsystem>_<name>_<unit>_<suffix>. Not all parts required — minimum
is namespace + meaningful name + unit/suffix.
snake_case — lowercase with underscores, matching [a-zA-Z_:][a-zA-Z0-9_:]*:) are reserved for recording rules — never use in direct instrumentation__) is reserved for Prometheus internalshttp_request_duration_seconds_total, counter-with-unit as _<unit>_total
(e.g., process_cpu_seconds_total)_info, timestamps with _timestamp_secondssum() or avg() across all dimensions should be meaningful. If nonsensical, split
into separate metrics.See references/naming.md for base units table, component ordering, and full examples.
Use labels for dimensions you will filter or aggregate by:
http_requests_total{method="GET", status="200"} — not separate metrics per status.
Do not put label names in metric names.
Every unique label combination is a new time series. Each costs RAM, CPU, disk, and network. Cardinality math: total series = metric cardinality x number of targets.
| Range | Guidance |
|---|---|
| < 10 | Safe for most metrics |
| 10-100 | Acceptable, monitor growth |
| 100-1000 | Investigate alternatives |
| > 1000 | Move analysis out of Prometheus |
sum().Key metrics: request rate (_total), error rate, latency (histogram), in-progress (gauge).
Key metrics per stage: items in (_total), items out (_total), in progress (gauge),
last processed timestamp (gauge), processing duration (histogram).
Key metrics (push to Pushgateway): last success timestamp (gauge), last completion timestamp (gauge), duration (gauge — single run, not distribution).
Instrument transparently — users get metrics without configuration. Minimum for external resource access: request count (counter), error count (counter), latency (histogram).
log_messages_total{level="..."} counter per log levelcache_requests_total{result="hit|miss"}, evictions (counter), size (gauge),
lookup latency (histogram). Also instrument the downstream system.See references/instrumentation.md for threadpool patterns, custom collectors, and
performance tuning in hot paths.
rate(counter[5m]) — per-second rate. Use for alerts and dashboards.increase(counter[5m]) — total increase. Sugar for rate() * range_seconds.irate(counter[5m]) — instant rate from last two samples. Only for graphing volatile
counters.rate() first, then aggregate: sum(rate(x[5m])), never rate(sum(x)[5m]). Rate
must see individual counter resets.rate() a gauge — use deriv() or delta().histogram_quantile(0.95, rate(metric_bucket[5m])) — single histogramle in the by clause:
histogram_quantile(0.95, sum by (job, le) (rate(metric_bucket[5m])))rate(metric_sum[5m]) / rate(metric_count[5m])rate() needs at least 2 samples. Range should be at least 4x
scrape interval. With 15s scrape, use rate(x[5m]) or wider.See references/promql.md for data types, selectors, aggregation operators, vector
matching, over-time aggregation, binary operators, and operator precedence.
Alert on symptoms (user-visible impact), not causes. Use dashboards to pinpoint causes after an alert fires.
| System Type | Alert On |
|---|---|
| Online-serving | High latency, high error rate (user-facing, high in the stack) |
| Offline processing | Data taking too long to get through the system |
| Batch jobs | Job has not succeeded recently enough (>= 2x normal cycle) |
| Capacity | Approaching resource limits that will cause outage without intervention |
Only page on latency at one point in the stack — if overall user latency is fine, don't page on a slow sub-component. Avoid noisy alerts — if an alert fires and there's nothing to do, remove it.
See references/alerting-and-rules.md for alert design, naming conventions, and
recording rule details.
Pre-compute frequently used or expensive expressions. Format:
level:metric:operations.
without for aggregation — preserves all labels except those being removed. Prefer
over by.See references/alerting-and-rules.md for full naming convention and recording rule
anti-patterns.
Write an exporter when the target system does not expose Prometheus metrics natively. For your own code, use a client library directly.
haproxy_up, mysql_global_status_threads_connectedrate() handle the restSee references/exporters.md for architecture, collectors, help strings, label rules,
and push-based sources.
When writing Prometheus instrumentation:
When writing PromQL queries:
rate() or increase() before further operations.without over by for aggregation.When writing alerting or recording rules:
level:metric:operations naming.When reviewing Prometheus code:
The coding skill governs workflow; this skill governs Prometheus implementation choices.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.