From observability-assets
Author Prometheus alert and recording rules with promtool-verified YAML, Go-templated annotations, and intentional noise control. Use this skill when writing Prometheus alert or recording rules, authoring rule YAML files, running promtool check rules, tuning alert noise with for/keep_firing_for, using Go templates in annotations, understanding alert state transitions, or needing guidance on PromQL expressions within alert-rule context.
npx claudepluginhub ririnto/sinon --plugin observability-assetsThis skill uses the workspace's default tool permissions.
Design and review Prometheus alert and recording rules around real operator symptoms, validate them with `promtool`, and keep rule definitions stable in version-controlled files. The common case is one rule group with a deliberate evaluation interval, one alert tied to a meaningful symptom, one explicit `for` window that avoids flapping, one clear alert name, one bounded label contract for Aler...
Prevents silent decimal mismatch bugs in EVM ERC-20 tokens via runtime decimals lookup, chain-aware caching, bridged-token handling, and normalization. For DeFi bots, dashboards using Python/Web3, TypeScript/ethers, Solidity.
Share bugs, ideas, or general feedback.
Design and review Prometheus alert and recording rules around real operator symptoms, validate them with promtool, and keep rule definitions stable in version-controlled files. The common case is one rule group with a deliberate evaluation interval, one alert tied to a meaningful symptom, one explicit for window that avoids flapping, one clear alert name, one bounded label contract for Alertmanager, and one validation path that proves the shipped rule file is sane before it lands.
for window, and add keep_firing_for only when brief recoveries or scrape gaps would otherwise cause noisy false resolution and the deployed Prometheus version supports it.runbook_url literal and trusted; use lightweight Go templates such as {{ $labels.service }} or {{ $value }} only in human-readable annotations where they improve operator clarity.promtool check rules, and hand off deeper regression-fixture work to the adjacent testing path when dedicated tests are needed.Every Prometheus rules file follows this hierarchy:
groups:
- name: <group-name> # required
interval: <duration> # optional, default = global eval interval
limit: <int> # optional, max rules per group
concurrency: <int> # optional, experimental, parallelism
rules:
- alert: <name> | record: <name>
expr: <promql> # required
for: <duration> # optional (alerts only)
keep_firing_for: <duration> # optional (alerts only)
labels: { ... } # optional
annotations: { ... } # optional
The top-level key is always groups. It holds an ordered list of groups. Each group has a name and a list of rules. Each rule is either an alert: rule or a record: rule.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | -- | Unique identifier for this group. Used in logs and UI. |
interval | duration | no | global evaluation_interval | How often this group's rules are evaluated. |
limit | integer | no | -- | Maximum number of rules allowed in this group. Exceeding it produces a load error. |
concurrency | integer | no | 1 | Experimental. Number of goroutines evaluating this group in parallel. |
rules | list | yes | -- | List of alert or recording rules in this group. |
Group naming convention:
groups:
- name: api-alerts # domain + purpose
rules: [...]
- name: infra-recording # domain + type
rules: [...]
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
alert | string | yes | -- | Alert identifier. Must be unique across all loaded rules files. |
expr | PromQL | yes | -- | Boolean expression. When true, the alert activates. |
for | duration | no | 0s | Time the expression must hold true before transitioning from pending to firing. |
keep_firing_for | duration | no | 0s | Time to keep firing after the expression becomes false. Requires Prometheus >= 2.98.0 / >= 3.0. |
labels | map | no | {} | Extra labels attached to the firing alert. Merged with $labels from the expression result. |
annotations | map | no | {} | Human-readable text attached to each firing alert instance. Supports Go templating. |
Complete alert rule example:
groups:
- name: api-latency
interval: 1m
rules:
- alert: ApiP95LatencyAbove750ms
expr: >-
0.75 < round(
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
),
0.001
)
for: 10m
keep_firing_for: 5m
labels:
severity: page
service: api
team: edge
annotations:
summary: API p95 latency is high
description: |-
API p95 latency stayed above 750ms for 10 minutes.
Current value: {{ $value }}s.
Service: {{ $labels.job }}
runbook_url: https://runbooks.example.com/api-high-latency
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
record | string | yes | -- | Metric name for the recorded output. Must follow Prometheus metric naming conventions. |
expr | PromQL | yes | -- | Expression evaluated at each group interval. Result becomes the recorded metric. |
labels | map | no | {} | Extra labels attached to every sample of the recorded series. |
Complete recording rule example:
groups:
- name: api-recording
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: >-
round(sum by (job) (rate(http_requests_total[5m])), 0.001)
labels:
environment: production
Recording-rule naming convention:
level:metric:operations shape such as job:http_requests:rate5mlevel describe the aggregation level after labels are removed or collapsedrate5m or avg_rate5mrate() or irate(), drop _total from the recorded metric namewithout (...) when the main review need is making removed labels explicit, but use by (...) when the retained label set is the clearer part of the contractPrometheus alerts transition through three states:
inactive --> pending --> firing
^ |
|____________________|
resolves
Inactive: The expression evaluates to false (or produces no series). No alert state exists.
Pending: The expression first evaluates to true. A timer starts for the for duration. During pending:
for elapses, the alert reverts to inactivefor duration, the alert transitions to firingFiring: The for duration has been satisfied while the expression remains true.
keep_firing_for: the alert immediately returns to inactivekeep_firing_for: the alert stays in firing for the configured hold-open duration, then returns to inactiveFlapping behavior: An alert that oscillates around the threshold resets its pending timer each time it drops back to inactive. Only a continuous for window of true evaluations causes a transition to firing.
Timing diagram for for: 10m with keep_firing_for: 5m:
Time 0m 5m 10m 15m 20m 25m
Expr T T T T F F
State pen pen fire fire fire inactive
^--keep_firing_for holds here
These variables are available inside {{ }} template expressions in annotation values:
| Variable | Type | Description |
|---|---|---|
$labels | map | Labels of the alert instance (from the expression result). Access fields as $labels.label_name. |
$value | string | String representation of the alert expression value at the time of firing. |
$externalLabels | map | External labels configured on the Prometheus server (prometheus.yml -> global.external_labels). |
$externalURL | string | Configured external URL of the Prometheus server (prometheus.yml -> global.external_url). |
Template syntax reference:
annotations:
# Simple variable interpolation
summary: 'High error rate on {{ $labels.job }}'
# Value formatting
description: 'Error rate is {{ $value }} over the last 5 minutes.'
# Conditional rendering
description: >-
{{ if gt (parseFloat $value) 50 }}Critical{{ else }}Warning{{ end }}:
value is {{ $value }}
# External URL for dashboard links
dashboard: '{{ $externalURL }}/d/my-dashboard?var-job={{ $labels.job }}'
Template rules:
{{ $labels.name }} syntax -- dot notation like {{ $labels.name }} works; bracket notation {{ $labels["name"] }} handles special characters$value is always a string; compare numerically with helper functions where availablelabels values unless you have a specific reason -- labels should be stable and low-cardinalityrunbook_url as literal strings without template variablesPrometheus rules belong in a file loaded by the server or rule-evaluation stack, and promtool should run against the exact file shape that will ship.
promtool must already be installed and available in PATH before you treat a rule edit as ready. If it is unavailable, stop at a blocked validation state instead of claiming the rule is ready.
Minimal alerting file:
groups:
- name: api-latency
interval: 1m
rules:
- alert: ApiP95LatencyAbove750ms
expr: >-
0.75 < round(
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
),
0.001
)
for: 10m
labels:
severity: page
service: api
team: edge
annotations:
summary: API p95 latency is high
description: |-
API p95 latency stayed above 750ms for 10 minutes.
Current value: {{ $value }}s.
runbook_url: https://runbooks.example.com/api-high-latency
Use when: you need one minimal valid Prometheus rules file with a complete alert definition.
Start with syntax validation against the actual rule file:
promtool check rules alerts/api-latency.rules.yaml
Use when: the rule is newly added or edited, promtool is available in PATH, and you need the first safe correctness check before deeper review.
Validate multiple files at once:
promtool check rules rules/*.yaml
Strict mode (fail on warnings):
promtool check --strict rules alerts/api-latency.rules.yaml
Successful output:
rules/api-latency.rules.yaml SUCCESS: 1 rules found
Error output examples:
# Syntax error in PromQL expression
rules/api-latency.rules.yaml FAILED: parsing YAML file rules/api-latency.rules.yaml: error parsing rules: 1:13: parse error
# Duplicate alert name across files
rules/api-latency.rules.yaml FAILED: alert "ApiP95LatencyAbove750ms" is defined twice
# Invalid duration format
rules/api-latency.rules.yaml FAILED: error parsing rules: invalid duration "10"
# Missing required field
rules/api-latency.rules.yaml FAILED: error parsing rules: missing required field 'expr'
YAML scalar rule:
|- for multiline PromQL strings>- for one-line expressions when plain scalars would become fragile or need escapinground(expr, 0.001) or an equally explicit precision when rate(), division, or quantile evaluation is expected to produce decimal valuesthreshold < expr or threshold <= expr so the smaller value stays on the leftRule lifecycle rule:
for, a matching series becomes active immediately on evaluationfor, a matching series stays pending until the duration is satisfiedkeep_firing_for keeps the alert firing briefly after the condition clears so transient gaps do not create noisy resolve/re-fire churnHigh error-rate alert -- page on sustained user-visible failures instead of one short spike:
groups:
- name: api-errors
rules:
- alert: Api5xxRatioAbove5Percent
expr: |-
5 < round(
100 * sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m])),
0.001
)
for: 10m
labels:
severity: page
service: api
annotations:
summary: API 5xx ratio is high
description: |-
API 5xx ratio stayed above 5% for 10 minutes.
Current value: {{ $value }}%.
runbook_url: https://runbooks.example.com/api-high-error-rate
Use when: user-visible failures matter more than one host-level saturation signal.
Recording-rule-assisted alert -- reduce repeated query cost and stabilize the alert expression:
groups:
- name: api-recording
rules:
- record: job:http_requests:rate5m
expr: >-
round(sum by (job) (rate(http_requests_total[5m])), 0.001)
- name: api-alerts
rules:
- alert: Api5xxRatioAbove5Percent
expr: |-
5 < round(
100 * sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/
sum(job:http_requests:rate5m{job="api"}),
0.001
)
for: 10m
labels:
severity: page
service: api
annotations:
summary: API 5xx ratio is high
description: |-
API 5xx ratio stayed above 5% for 10 minutes.
Current value: {{ $value }}%.
Use when: the alert expression is reused or expensive enough to deserve one stable recording layer.
Multi-threshold alert with severity tiering via separate rules:
groups:
- name: disk-usage
interval: 1m
rules:
- alert: DiskSpaceWarning
expr: >-
80 < 100 * node_filesystem_avail_bytes{fstype!="tmpfs"}
/
node_filesystem_size_bytes{fstype!="tmpfs"}
for: 30m
labels:
severity: warning
annotations:
summary: 'Low disk space on {{ $labels.instance }}'
description: >-
Disk usage is above 80% on {{ $labels.instance }} (device {{ $labels.device }}).
Current: {{ $value }}%.
- alert: DiskSpaceCritical
expr: >-
95 < 100 * node_filesystem_avail_bytes{fstype!="tmpfs"}
/
node_filesystem_size_bytes{fstype!="tmpfs"}
for: 10m
labels:
severity: critical
annotations:
summary: 'Critical disk space on {{ $labels.instance }}'
description: >-
Disk usage is above 95% on {{ $labels.instance }} (device {{ $labels.device }}).
Current: {{ $value }}%.
Use when: the same symptom needs different urgency levels at different thresholds.
Validate the common case with these checks:
for is explicit and long enough to avoid obvious flappingkeep_firing_for is present only when it reduces noisy false resolutionrunbook_url stay literal and trustedpromtool check rules passes on the actual shipped fileReturn:
| If the blocker is... | Read... |
|---|---|
deciding whether an alert needs dedicated regression coverage, recording-rule settle-time checks, or for/keep_firing_for lifecycle protection | ./references/rule-testing.md |
| choosing between direct alerts, recording rules, or reusable low-noise alert patterns | ./references/alert-patterns.md |
promtool check rules before claiming they are ready.for, and keep_firing_for choices deliberate.< or <= so the smaller value stays on the left.promtool test rules coverage for important or subtle alerts.| Anti-pattern | Why it fails | Correct move |
|---|---|---|
alerting on one short spike with no for window | the rule flaps and burns operator trust | add a deliberate for window aligned with the symptom |
| using high-cardinality labels such as pod UID in alert labels | routing and deduplication become noisy and unstable | keep labels bounded and move volatile detail into annotations or investigation steps |
| writing only the alert and never validating it | syntax and behavior drift are caught too late | run promtool check rules immediately and add tests when the rule matters |
| alerting on a low-level cause with no user-facing impact | operators get pages with weak actionability | page on the symptom and keep lower-level metrics as supporting signals |
putting template variables in runbook_url or other link annotations | broken links when template data contains URL-unsafe characters | keep link-like annotations as literal trusted strings |
duplicate alert: names across rule files | Prometheus rejects the config on load | ensure each alert name is globally unique across all loaded rule files |
setting for: 0s explicitly instead of omitting it | signals intent to fire immediately, which is almost never what you want | either omit for entirely or set a meaningful duration based on the symptom's natural timescale |
for, keep_firing_for, labels, annotations, templating, and recording-rule-aware alert designpromtool check rules validation and the rule-side label contract for Alertmanager handoff