Skill
Community

alerting-strategies

Install
1
Install the plugin
$
npx claudepluginhub latestaiagents/agent-skills --plugin devops-sre

Want just this skill?

Then install: npx claudepluginhub u/[userId]/[slug]

Description

Design effective alerting strategies that catch real issues without causing alert fatigue. Use this skill when setting up alerts, reducing noise, or improving on-call experience. Activate when: alerting, alerts, pagerduty, on-call, alert fatigue, too many alerts, missed alerts, monitoring thresholds, alert tuning.

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Alerting Strategies

Get paged for real problems, not noise.

Alerting Philosophy

"Every alert should be actionable, and every action should have a runbook."

The Goal

  • Page for symptoms (user impact), not causes (internal metrics)
  • Every page should require human judgment
  • False positives erode trust; false negatives cause outages

Alert Severity Levels

LevelResponseTime to AckExample
P1/CriticalPage immediately5 minService down, data loss
P2/HighPage during hours30 minDegraded performance
P3/MediumTicketNext dayNon-critical feature broken
P4/LowReview weeklyN/ACleanup tasks, warnings

Alert Types

1. Symptom-Based (Recommended)

Alert on what users experience:

# Good: Users are experiencing errors
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1%"
    runbook: "https://wiki/runbooks/high-error-rate"

2. Cause-Based (Use Sparingly)

Alert on infrastructure issues that will cause symptoms:

# Acceptable: Will cause problems soon
- alert: DiskSpaceLow
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Disk space below 10%"

3. SLO-Based (Best Practice)

Alert on error budget consumption:

# Excellent: Based on SLO burn rate
- alert: SLOBurnRateHigh
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Burning error budget 14x faster than sustainable"

Multi-Window, Multi-Burn-Rate Alerts

Google SRE's recommended approach:

# Fast burn (page immediately)
- alert: SLOBurnRateFast
  expr: |
    (
      job:slo_errors_per_request:ratio_rate1h > (14.4 * 0.001)
      and
      job:slo_errors_per_request:ratio_rate5m > (14.4 * 0.001)
    )
  for: 2m
  labels:
    severity: critical

# Slow burn (page during business hours)
- alert: SLOBurnRateSlow
  expr: |
    (
      job:slo_errors_per_request:ratio_rate6h > (1 * 0.001)
      and
      job:slo_errors_per_request:ratio_rate30m > (1 * 0.001)
    )
  for: 15m
  labels:
    severity: warning

Alert Design Patterns

Pattern 1: Percentage-Based Thresholds

# Alert when error rate exceeds normal baseline
- alert: ErrorRateAnomaly
  expr: |
    (
      sum(rate(http_errors_total[5m]))
      /
      sum(rate(http_requests_total[5m]))
    )
    >
    (
      sum(rate(http_errors_total[1d] offset 1d))
      /
      sum(rate(http_requests_total[1d] offset 1d))
    ) * 2

Pattern 2: Absence Detection

# Alert when service stops reporting
- alert: ServiceDown
  expr: absent(up{job="api-service"} == 1)
  for: 5m

Pattern 3: Derivative-Based

# Alert on rapid change
- alert: LatencySpike
  expr: deriv(http_request_duration_seconds_sum[5m]) > 0.1
  for: 2m

Alert Fatigue Prevention

Checklist Before Creating Alert

□ Is this actionable? What should responder do?
□ Does a runbook exist?
□ Is this a symptom or a cause?
□ What's the false positive rate likely to be?
□ Can this be a ticket instead of a page?
□ Is the threshold based on data, not gut feel?
□ Does it have appropriate for/pending duration?

Noisy Alert Remediation

ProblemSolution
Too many pagesIncrease threshold or duration
Flapping alertsAdd hysteresis (different up/down thresholds)
Duplicate alertsUse alert grouping/inhibition
Low-signal alertsConvert to ticket or remove
Night pages for non-urgentRoute to next business day

Alert Hygiene Process

Weekly:
- Review all alerts that fired
- Tag: actionable / noise / duplicate
- Fix or remove noisy alerts

Monthly:
- Review alert coverage vs incidents
- Identify incidents with no alerts (gaps)
- Identify alerts that never fired (remove?)

Quarterly:
- Full alert audit
- Update thresholds based on SLO performance
- Review on-call burden metrics

Alert Routing

Example PagerDuty Integration

# Route based on service and severity
receivers:
  - name: 'platform-critical'
    pagerduty_configs:
      - service_key: '<platform-team-key>'
        severity: critical

  - name: 'platform-warning'
    pagerduty_configs:
      - service_key: '<platform-team-key>'
        severity: warning

  - name: 'tickets'
    webhook_configs:
      - url: 'https://jira.company.com/webhook'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'platform-warning'
  routes:
    - match:
        severity: critical
      receiver: 'platform-critical'
    - match:
        severity: low
      receiver: 'tickets'

Alert Documentation Template

Every alert should have:

# Alert: HighErrorRate

## What This Means
Error rate has exceeded 1% for the past 5 minutes.
Users are experiencing failures.

## Impact
- Users see error pages
- API consumers get 500 responses
- Potential revenue impact

## First Response
1. Check deployment timeline - recent deploy?
2. Check dependency status (database, external APIs)
3. Look at error logs for specific error messages

## Runbook
[Link to detailed runbook]

## Escalation
If unresolved after 15 minutes, page @platform-lead

## Historical Context
- Normal error rate: 0.01-0.05%
- Common causes: bad deploys, DB issues, traffic spikes

Metrics for Alerting Health

Track these to improve your alerting:

MetricTargetWhy
MTTA (Mean Time to Acknowledge)<5 minAre pages noticed?
Pages per week per engineer<10Alert fatigue risk
% actionable pages>80%Signal vs noise
Incidents with no alerts<10%Coverage gaps
False positive rate<20%Trust in alerts
Stats
Stars2
Forks0
Last CommitFeb 5, 2026

Similar Skills