SLI/SLO/SLA and error budget workflow: define service level indicators, set objectives, calculate error budgets, implement burn rate alerting, and use error budgets to gate risky deployments. Covers Prometheus, Datadog, and Google SRE methodology.
From clarcnpx claudepluginhub marvinrichter/clarc --plugin clarcThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Scope: Reliability engineering — defining what "good" looks like and alerting when you're heading toward bad. For implementing the monitoring infrastructure, see observability. For incident response when SLOs are breached, see incident-response.
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator — the measured metric | 99.2% of requests completed in < 500ms over last 28 days |
| SLO | Service Level Objective — the internal reliability target | 99.5% availability |
| SLA | Service Level Agreement — the customer-facing contractual commitment | 99.9% uptime, compensated if breached |
| Error Budget | The allowed amount of unreliability: 1 - SLO | 0.5% = 3.65 hours of downtime per month |
Rule: SLA < SLO < actual performance. Buffer between SLO and SLA absorbs measurement lag.
SLIs should measure user pain, not infrastructure health. Prefer request-level metrics over host metrics.
| Service type | Recommended SLIs |
|---|---|
| Request/response (API) | Availability (% success), Latency (p95/p99) |
| Data pipeline | Freshness (time since last update), Correctness |
| Storage | Durability (data loss rate), Throughput |
| Streaming | Throughput (events/sec), Consumer lag |
# Prometheus: fraction of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Fraction of requests completing in < 300ms
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) < 0.3
| SLO | Max downtime/month | Max downtime/year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.5% | 3.6 hours | 1.83 days |
| 99.9% | 43.8 minutes | 8.76 hours |
| 99.95% | 21.9 minutes | 4.38 hours |
| 99.99% | 4.4 minutes | 52.6 minutes |
def error_budget(slo_percent, window_days=28):
"""Calculate error budget for a given SLO and window."""
error_rate = 1 - (slo_percent / 100)
total_minutes = window_days * 24 * 60
budget_minutes = total_minutes * error_rate
budget_requests_per_million = error_rate * 1_000_000
return {
'budget_minutes': budget_minutes,
'budget_hours': budget_minutes / 60,
'budget_rpm': budget_requests_per_million,
}
# SLO 99.5%, 28-day window
budget = error_budget(99.5)
# → {'budget_minutes': 201.6, 'budget_hours': 3.36, 'budget_rpm': 5000}
# Availability error budget consumed in 28 days (%)
(
1 - (
sum(increase(http_requests_total{status!~"5.."}[28d]))
/
sum(increase(http_requests_total[28d]))
)
) / 0.005 # divide by (1 - SLO) = error budget
Paging on SLO breach (threshold alerts) is too slow. Use burn rate to alert early.
A burn rate of 1 = consuming error budget at exactly the rate that would exhaust it by the end of the window. A burn rate of 14.4 over 1 hour = will exhaust the entire 28-day budget in 2 hours.
| Alert | Short window | Long window | Burn rate | Severity |
|---|---|---|---|---|
| Page | 5 min | 1 hour | 14.4x | Critical |
| Page | 30 min | 6 hours | 6x | Critical |
| Ticket | 6 hours | 3 days | 3x | Warning |
| Notify | 3 days | — | 1x | Info |
# Prometheus alerting rules (99.5% SLO, 28-day window)
groups:
- name: slo_burn_rate
rules:
- alert: ErrorBudgetCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > 14.4 * 0.005 # 14.4x burn rate * (1 - SLO)
and
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 14.4 * 0.005
for: 0m
labels:
severity: page
annotations:
summary: "Error budget burning at 14.4x — exhausted in 2 hours"
- alert: ErrorBudgetWarning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > 6 * 0.005
and
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > 6 * 0.005
for: 0m
labels:
severity: ticket
annotations:
summary: "Error budget burning at 6x — exhausted in ~2 days"
Define what happens at each budget threshold:
| Budget remaining | Policy |
|---|---|
| > 50% | Normal velocity: ship features, run experiments |
| 25–50% | Review risky changes before shipping |
| 10–25% | Freeze non-critical deployments; focus on reliability |
| < 10% | Freeze all deployments; all-hands reliability sprint |
| 0% | SLA breach risk; escalate to VP/CTO |
#!/bin/bash
# check-error-budget.sh — fail CI if budget < 10%
BUDGET_PCT=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode "query=error_budget_remaining_pct" \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET_PCT < 10" | bc -l) )); then
echo "ERROR: Error budget at ${BUDGET_PCT}% — deployment blocked"
exit 1
fi
echo "Error budget at ${BUDGET_PCT}% — deployment allowed"
## SLO: [Service Name] — [SLI Type]
**Owner**: [team]
**SLI**: [exact metric definition]
**SLO**: [X]% over a 28-day rolling window
**SLA** (if applicable): [Y]% — breach triggers [compensation/escalation]
### Error budget
- 28-day budget: [N minutes / N requests per million]
- Budget policy: [link to policy doc]
### Alert thresholds
| Alert | Condition | Severity | Response |
|-------|-----------|----------|---------|
| Page | 14.4x burn (5m + 1h windows) | Critical | On-call response < 15min |
| Ticket | 6x burn (30m + 6h windows) | Warning | Next business day |
### Measurement
- Query: [Prometheus/Datadog query]
- Dashboard: [link]
- Report cadence: Weekly in engineering all-hands
### History
| Month | SLO met? | Budget consumed | Incidents |
|-------|----------|----------------|-----------|
This example traces a new API service from zero to fully alerting SLOs.
1. Define SLI — measure user-facing success rate in Prometheus:
# availability-sli.promql
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/ sum(rate(http_requests_total{service="payment-api"}[5m]))
2. Set SLO — target 99.5% availability over a 28-day rolling window.
Budget: (1 - 0.995) * 28 * 24 * 60 = 201.6 minutes of acceptable downtime.
3. Create burn rate alert — page on-call if budget burns at 14.4× (exhausted in 2 h):
# prometheus/rules/payment-api-slo.yml
- alert: PaymentAPIErrorBudgetCritical
expr: |
(sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="payment-api"}[5m])))
> 14.4 * 0.005
labels:
severity: page
annotations:
summary: "payment-api burning error budget at 14.4× — deploy freeze"
4. Enforce budget policy — gate CI deploys when budget < 10%:
# .github/workflows/deploy.yml (excerpt)
- name: Check error budget
run: bash scripts/check-error-budget.sh payment-api 10
# → exits 1 and blocks deploy if budget < 10%
5. Trigger postmortem — when budget hits 0%, auto-open a Jira ticket:
BUDGET=$(curl -s "$PROMETHEUS/api/v1/query?query=error_budget_remaining_pct{service='payment-api'}" \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET <= 0" | bc -l) )); then
gh issue create --repo org/postmortems \
--title "SLO breach: payment-api $(date +%Y-%m-%d)" \
--body "Error budget exhausted. See Grafana dashboard: $DASHBOARD_URL"
fi
/slo command — define and document SLOs for a service