From devops-sre
Design effective alerting strategies that catch real issues without causing alert fatigue. Use this skill when setting up alerts, reducing noise, or improving on-call experience. Activate when: alerting, alerts, pagerduty, on-call, alert fatigue, too many alerts, missed alerts, monitoring thresholds, alert tuning.
npx claudepluginhub latestaiagents/agent-skills --plugin devops-sreThis skill uses the workspace's default tool permissions.
**Get paged for real problems, not noise.**
Writes SLO-based alert rules with burn-rate thresholds and paired runbooks. Outputs configs for Prometheus/Grafana, Datadog, CloudWatch. Use for setting up alerts, defining SLOs, or runbooks.
Creates alerting rules for Prometheus, Grafana, and PagerDuty with thresholds, routing, escalation, and runbooks. Useful for performance monitoring setup and refinement.
Configures Prometheus Alertmanager with routing trees, receivers (Slack, PagerDuty, email), inhibition rules, silences, and notification templates for proactive monitoring and incident alerting.
Share bugs, ideas, or general feedback.
Get paged for real problems, not noise.
"Every alert should be actionable, and every action should have a runbook."
| Level | Response | Time to Ack | Example |
|---|---|---|---|
| P1/Critical | Page immediately | 5 min | Service down, data loss |
| P2/High | Page during hours | 30 min | Degraded performance |
| P3/Medium | Ticket | Next day | Non-critical feature broken |
| P4/Low | Review weekly | N/A | Cleanup tasks, warnings |
Alert on what users experience:
# Good: Users are experiencing errors
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1%"
runbook: "https://wiki/runbooks/high-error-rate"
Alert on infrastructure issues that will cause symptoms:
# Acceptable: Will cause problems soon
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space below 10%"
Alert on error budget consumption:
# Excellent: Based on SLO burn rate
- alert: SLOBurnRateHigh
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate
for: 5m
labels:
severity: critical
annotations:
summary: "Burning error budget 14x faster than sustainable"
Google SRE's recommended approach:
# Fast burn (page immediately)
- alert: SLOBurnRateFast
expr: |
(
job:slo_errors_per_request:ratio_rate1h > (14.4 * 0.001)
and
job:slo_errors_per_request:ratio_rate5m > (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
# Slow burn (page during business hours)
- alert: SLOBurnRateSlow
expr: |
(
job:slo_errors_per_request:ratio_rate6h > (1 * 0.001)
and
job:slo_errors_per_request:ratio_rate30m > (1 * 0.001)
)
for: 15m
labels:
severity: warning
# Alert when error rate exceeds normal baseline
- alert: ErrorRateAnomaly
expr: |
(
sum(rate(http_errors_total[5m]))
/
sum(rate(http_requests_total[5m]))
)
>
(
sum(rate(http_errors_total[1d] offset 1d))
/
sum(rate(http_requests_total[1d] offset 1d))
) * 2
# Alert when service stops reporting
- alert: ServiceDown
expr: absent(up{job="api-service"} == 1)
for: 5m
# Alert on rapid change
- alert: LatencySpike
expr: deriv(http_request_duration_seconds_sum[5m]) > 0.1
for: 2m
□ Is this actionable? What should responder do?
□ Does a runbook exist?
□ Is this a symptom or a cause?
□ What's the false positive rate likely to be?
□ Can this be a ticket instead of a page?
□ Is the threshold based on data, not gut feel?
□ Does it have appropriate for/pending duration?
| Problem | Solution |
|---|---|
| Too many pages | Increase threshold or duration |
| Flapping alerts | Add hysteresis (different up/down thresholds) |
| Duplicate alerts | Use alert grouping/inhibition |
| Low-signal alerts | Convert to ticket or remove |
| Night pages for non-urgent | Route to next business day |
Weekly:
- Review all alerts that fired
- Tag: actionable / noise / duplicate
- Fix or remove noisy alerts
Monthly:
- Review alert coverage vs incidents
- Identify incidents with no alerts (gaps)
- Identify alerts that never fired (remove?)
Quarterly:
- Full alert audit
- Update thresholds based on SLO performance
- Review on-call burden metrics
# Route based on service and severity
receivers:
- name: 'platform-critical'
pagerduty_configs:
- service_key: '<platform-team-key>'
severity: critical
- name: 'platform-warning'
pagerduty_configs:
- service_key: '<platform-team-key>'
severity: warning
- name: 'tickets'
webhook_configs:
- url: 'https://jira.company.com/webhook'
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'platform-warning'
routes:
- match:
severity: critical
receiver: 'platform-critical'
- match:
severity: low
receiver: 'tickets'
Every alert should have:
# Alert: HighErrorRate
## What This Means
Error rate has exceeded 1% for the past 5 minutes.
Users are experiencing failures.
## Impact
- Users see error pages
- API consumers get 500 responses
- Potential revenue impact
## First Response
1. Check deployment timeline - recent deploy?
2. Check dependency status (database, external APIs)
3. Look at error logs for specific error messages
## Runbook
[Link to detailed runbook]
## Escalation
If unresolved after 15 minutes, page @platform-lead
## Historical Context
- Normal error rate: 0.01-0.05%
- Common causes: bad deploys, DB issues, traffic spikes
Track these to improve your alerting:
| Metric | Target | Why |
|---|---|---|
| MTTA (Mean Time to Acknowledge) | <5 min | Are pages noticed? |
| Pages per week per engineer | <10 | Alert fatigue risk |
| % actionable pages | >80% | Signal vs noise |
| Incidents with no alerts | <10% | Coverage gaps |
| False positive rate | <20% | Trust in alerts |