From pm-engineering
Defines Service Level Objectives (SLOs) and error budget policies for services. Creates documents with SLIs, targets, burn rate alerts, and review cadences.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pm-engineering:slo-error-budgetThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
Produce a complete, implementable SLO document for a service — covering what to measure, what target to set, how to calculate the error budget, and what to do when it burns.
A good SLO is not a target to hit. It is an agreement about what reliability means for your users — and a framework for making principled trade-offs between reliability and velocity.
Ask for these if not already provided:
Always establish these before writing the SLO:
| Term | Definition |
|---|---|
| SLI (Service Level Indicator) | The metric being measured — e.g. "% of requests completing successfully in <500ms" |
| SLO (Service Level Objective) | The target for that metric — e.g. "99.5% of requests" |
| SLA (Service Level Agreement) | The contractual commitment to customers — must be looser than the SLO |
| Error budget | The allowed headroom below 100% — the budget for planned and unplanned downtime |
| Burn rate | How fast the error budget is being consumed |
Service: [Name] | Team: [Team name] Owner: [Name / role] | Approved by: [Name] Effective date: [Date] | Review date: [Date + 3 months] Version: [1.0]
[2–3 sentences. What reliability problem are we solving? What was happening before this SLO that made us need it? What decision-making does this SLO enable?]
What this service does: [One sentence] Who depends on it: [Internal teams / external customers / both — describe] Critical user journeys protected by this SLO:
Define one SLI per user journey or reliability dimension. Keep it to 3–5 SLIs maximum.
| Field | Detail |
|---|---|
| What it measures | [e.g. "% of API requests that return a non-5xx response"] |
| Good event definition | [e.g. "HTTP response with status 2xx or 4xx, completed within 500ms"] |
| Bad event definition | [e.g. "HTTP response with status 5xx, or any response taking >500ms"] |
| Measurement source | [e.g. "Application load balancer access logs / Datadog APM / Prometheus"] |
| Measured over | Rolling 28-day window |
| Exclusions | [e.g. "Health check endpoints excluded / Requests during planned maintenance excluded"] |
| Field | Detail |
|---|---|
| What it measures | [e.g. "P99 response time for the /checkout endpoint"] |
| Good event definition | [e.g. "Request completes in ≤500ms at P99"] |
| Bad event definition | [e.g. "Request takes >500ms at P99"] |
| Measurement source | [Source] |
| Measured over | Rolling 28-day window |
| Exclusions | [Any exclusions] |
[Same structure]
| SLI | Target | Window | Error Budget |
|---|---|---|---|
| [SLI 1 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 2 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
| [SLI 3 name] | [X]% | 28-day rolling | [100 - X]% = [Y minutes/month] |
How targets were set:
What 100% is NOT the target: [Brief explanation of why targeting 100% is counterproductive — it discourages feature development and doesn't reflect user reality]
For SLI 1 ([Name]), at [X]% target:
Error budget = (100% - SLO target) × measurement window
= (100% - [X]%) × 28 days × 24 hours × 60 minutes
= [Y]% × [Z total minutes]
= [N] minutes of allowed failure per 28-day window
In plain terms: We can afford [N] minutes of [bad events] in any rolling 28-day window before we breach the SLO.
Burn rate = how fast the error budget is being consumed relative to the budget window. A burn rate of 1 = consuming the budget at exactly the rate that would exhaust it over 28 days.
| Alert | Burn rate | Window | Severity | Response |
|---|---|---|---|---|
| Page (critical) | >14× | 1 hour | P1 | Page on-call immediately — budget exhausted in <2 hours |
| Page (high) | >6× | 6 hours | P2 | Page on-call — budget exhausted in <5 days |
| Ticket (warning) | >3× | 3 days | P3 | Create ticket — review at next team meeting |
| Info | >1× | 28 days | Info | Log only — budget on track to exhaust by end of window |
Alert implementation: [Link to alert config in monitoring tool — e.g. Datadog, Prometheus/Alertmanager, Grafana]
This policy defines what to do with the error budget — both when it's healthy and when it's burning.
SLO dashboard: [Link to Datadog / Grafana / etc. dashboard]
Metrics exposed:
Reporting cadence:
| Audience | Frequency | Format |
|---|---|---|
| Engineering team | Weekly | Slack summary — #[service]-slo |
| Engineering manager | Monthly | SLO review meeting |
| Stakeholders / customers | Quarterly | SLO compliance summary |
Planned maintenance: Error budget is not consumed during pre-announced maintenance windows. Maintenance must be communicated [X hours] in advance via [channel].
Dependency failures: If SLO breach is caused by an upstream dependency outside our control, document it — but it still counts against our error budget (our users don't distinguish between our failures and our dependencies' failures).
Force majeure: [Policy for cloud provider outages, major infrastructure events]
| Review | When | Who | Output |
|---|---|---|---|
| Error budget review | Weekly | Team | Budget health check — adjust if burning fast |
| SLO target review | Quarterly | Team + EM | Adjust targets if baseline has shifted significantly |
| Annual SLO audit | Annually | Team + Stakeholders | Review SLIs — are we measuring the right things? |
When to change the SLO target:
npx claudepluginhub mohitagw15856/pm-claude-skills --plugin pm-engineeringDefines service reliability targets, error budgets, and SLI/SLO/SLA structures based on Google SRE practices. Use when designing or reviewing reliability commitments.
Designs SLOs with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets or service indicators.
Defines and implements SLIs, SLOs, and error budgets for service reliability using PromQL queries for availability/latency, YAML configs, and downtime calculations. Useful for reliability targets and alerts.