Help us improve
Share bugs, ideas, or general feedback.
From rampstack-skills
Designs monitoring systems: SLOs, uptime checks, error tracking, alert routing, on-call rotations. Use when setting up or fixing monitoring, alert fatigue, or incident gaps.
npx claudepluginhub rampstackco/claude-skills --plugin rampstack-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/rampstack-skills:monitoring-and-alertingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Decide what to watch, what to alert on, and how to make sure the right person finds out when things break.
Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
Writes SLO-based alert rules with burn-rate thresholds and paired runbooks for Prometheus, Grafana, Datadog, CloudWatch. Outputs configs when asked to set up alerts, create runbooks, or define SLOs.
Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.
Share bugs, ideas, or general feedback.
Decide what to watch, what to alert on, and how to make sure the right person finds out when things break.
incident-response)after-action-report)analytics-strategy)performance-optimization)Monitoring works in layers. Skip a layer and you'll miss a class of problems.
Is the site up? The simplest, most important layer.
Threshold: any sustained downtime (more than 2 consecutive failed checks) pages.
The site is up, but is it serving the right thing?
Threshold: failures of critical-path synthetics page. Non-critical page-level synthetics alert during business hours only.
The site is up and correct, but is it fast enough?
Threshold: regressions from baseline (e.g., p95 doubled in 5 minutes). Don't alert on absolute thresholds without baselines.
The site is up, correct, and fast for most, but errors are happening.
Threshold: rate-based, not count-based. "Error rate above 1% for 5 minutes" beats "more than 100 errors per minute."
A Service Level Objective is the target for reliability. Common form: "99.9% of homepage requests succeed in under 2 seconds, measured over 30 days."
The components:
The error budget is the inverse: 0.1% of requests can fail. If you've used the whole budget, slow down on risky changes.
Don't aim for 100%. Don't aim for "five nines" (99.999%) unless you really need it. Each nine costs an order of magnitude more.
| SLO | Allowed downtime per month |
|---|---|
| 99% | 7 hours, 18 minutes |
| 99.9% | 43 minutes |
| 99.95% | 21 minutes |
| 99.99% | 4 minutes, 22 seconds |
| 99.999% | 26 seconds |
For most marketing sites, 99.9% is plenty. For SaaS, 99.95% is reasonable. Anything higher needs significant infrastructure investment.
When the budget is healthy, ship aggressively. When the budget is half-spent, slow down. When the budget is exhausted, freeze risky changes until reliability recovers.
This is what makes SLOs useful: they create a feedback loop between reliability and velocity.
What tools are in place? What checks exist? What dashboards? What alerts?
Many teams have a tangle of half-configured tools. The first job is the inventory.
Draw the architecture. Front-end, back-end, database, third-party APIs, queues, workers. Each box is a candidate for monitoring.
For each box, ask:
Pick 3-5 SLOs. They should be:
For each box, configure checks at each layer. Some boxes won't have all four; that's fine.
| Box | Availability | Correctness | Performance | Errors |
|---|---|---|---|---|
| Homepage | HTTP check | Synthetic | LCP/INP | JS errors |
| Login API | HTTP check | Synthetic flow | p95 latency | 5xx rate |
Three tiers:
Anything in tier 1 must be:
If tier 1 alerts fire frequently, alert fatigue sets in. People stop responding.
Where do alerts go?
Each tier should have a documented escalation path. If the on-call doesn't ack within 5-15 minutes, escalate.
One dashboard per audience:
Dashboards are different from alerts. Alerts say "look now." Dashboards say "here's what's happening."
Every quarter, audit:
Tune the system. Monitoring drifts without active maintenance.
Alert on cause, not symptom. "CPU is high" is a cause. "Users are slow" is a symptom. Alert on symptoms; investigate causes.
Alert without a runbook. If the on-call doesn't know what to do, the alert is useless. Every paging alert needs a runbook (even a one-line one).
No baselines for "normal." Alerting on "more than 100 errors per minute" sounds reasonable but a busy day might exceed that without anything being wrong. Use rate-based and anomaly-based alerts.
Single-region monitoring. Your monitoring service in the same region as your site means you'll miss regional outages and you'll get woken up when monitoring itself has issues.
Monitoring the monitoring. Or rather, not. If your alerting platform is down, who tells you? Most paging services offer their own status feeds. Subscribe.
Too many tiers of severity. P0/P1/P2/P3/P4 with different SLAs becomes a sorting exercise. Three tiers (page, notify, log) is plenty.
Synthetics that don't match reality. A synthetic that hits the homepage every minute tests "is the homepage up." It doesn't test "is the actual user flow working." Build synthetics for the journeys that matter.
Static thresholds that never get tuned. Traffic grows, behavior changes, thresholds set last year are wrong. Review thresholds quarterly.
On-call rotation with no handoffs. Each new on-call has to figure out the system. Document. Run weekly handoff meetings or async updates.
Pager fatigue. If on-call is paged more than once or twice a week, something is wrong. Audit the alerts. Reduce, tune, or fix the underlying issues.
A monitoring plan includes:
references/slo-design-guide.md: Detailed walkthrough of writing SLOs, error budget policies, and common SLO mistakes for web services.