Design Service Level Objectives, Indicators, and error budgets
Designs SLOs, SLIs, and error budgets based on user journeys. Triggers when you need to create reliability targets, define metrics, or establish error budget policies for services.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install observability-planning@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Design Service Level Objectives, Indicators, and error budget policies.
Before designing SLOs:
docs-management skill for SLO/SLI patternsSLO/SLI/SLA RELATIONSHIP:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ SLA (Service Level Agreement) │
│ ├── External promise to customers │
│ ├── Legal/contractual implications │
│ └── Example: "99.9% monthly uptime" │
│ │
│ ▲ │
│ │ Buffer (SLO should be tighter) │
│ │ │
│ SLO (Service Level Objective) │
│ ├── Internal reliability target │
│ ├── Tighter than SLA (headroom) │
│ └── Example: "99.95% monthly availability" │
│ │
│ ▲ │
│ │ Measured by │
│ │ │
│ SLI (Service Level Indicator) │
│ ├── Actual measurement │
│ ├── Quantitative metric │
│ └── Example: "successful_requests / total_requests" │
│ │
│ ▲ │
│ │ Derived from │
│ │ │
│ Error Budget │
│ ├── Allowable unreliability: 100% - SLO │
│ ├── Example: 0.05% = 21.6 minutes/month │
│ └── Spent on: releases, incidents, maintenance │
│ │
└─────────────────────────────────────────────────────────────────┘
SLI CATEGORIES:
AVAILABILITY SLI:
"The proportion of requests that are served successfully"
Formula: successful_requests / total_requests × 100%
Good Events: HTTP 2xx, 3xx, 4xx (client errors)
Bad Events: HTTP 5xx, timeouts, connection failures
Example Prometheus query:
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
───────────────────────────────────────────────────────────────
LATENCY SLI:
"The proportion of requests that are served within threshold"
Formula: requests_below_threshold / total_requests × 100%
Thresholds (example):
- P50: 100ms (median experience)
- P95: 500ms (95th percentile)
- P99: 1000ms (tail latency)
Example Prometheus query:
sum(rate(http_request_duration_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_count[5m]))
───────────────────────────────────────────────────────────────
QUALITY/CORRECTNESS SLI:
"The proportion of requests that return correct results"
Formula: correct_responses / total_responses × 100%
Good Events: Valid data, expected format
Bad Events: Data corruption, stale data, wrong results
───────────────────────────────────────────────────────────────
FRESHNESS SLI:
"The proportion of data that is updated within threshold"
Formula: fresh_records / total_records × 100%
Example: "95% of records updated within 5 minutes"
───────────────────────────────────────────────────────────────
THROUGHPUT SLI:
"The proportion of time system handles expected load"
Formula: time_at_capacity / total_time × 100%
Example: "System handles 1000 req/s 99% of the time"
ERROR BUDGET MATH:
Monthly Error Budget (30 days):
SLO Target │ Error Budget │ Allowed Downtime
────────────┼──────────────┼──────────────────
99% │ 1% │ 7h 18m
99.5% │ 0.5% │ 3h 39m
99.9% │ 0.1% │ 43m 50s
99.95% │ 0.05% │ 21m 55s
99.99% │ 0.01% │ 4m 23s
99.999% │ 0.001% │ 26s
Error Budget Consumption:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Monthly Budget: 21m 55s (99.95% SLO) │
│ │
│ ████████████████░░░░░░░░░░░░░░░░ Used: 8m (36%) │
│ │
│ Incidents: │
│ - Jan 5: Database failover - 5m │
│ - Jan 12: Deployment rollback - 3m │
│ │
│ Remaining: 13m 55s (64%) │
│ │
│ Status: ✓ HEALTHY │
│ │
└─────────────────────────────────────────────────────────────────┘
SLO DESIGN WORKFLOW:
Step 1: IDENTIFY USER JOURNEYS
┌─────────────────────────────────────────────────────────────────┐
│ What do users care about? │
│ │
│ Critical User Journeys (CUJs): │
│ - Login and authentication │
│ - Search and browse products │
│ - Add to cart and checkout │
│ - View order status │
│ │
│ For each journey: │
│ - What constitutes success? │
│ - What latency is acceptable? │
│ - What's the business impact of failure? │
└─────────────────────────────────────────────────────────────────┘
Step 2: DEFINE SLIs
┌─────────────────────────────────────────────────────────────────┐
│ What can we measure that represents user happiness? │
│ │
│ For "Checkout" journey: │
│ - Availability: checkout completes without error │
│ - Latency: checkout completes within 3 seconds │
│ - Correctness: order total matches cart │
│ │
│ SLI Specification: │
│ - What events are we measuring? │
│ - What's a "good" event vs "bad" event? │
│ - Where do we measure? (server, client, synthetic) │
└─────────────────────────────────────────────────────────────────┘
Step 3: SET SLO TARGETS
┌─────────────────────────────────────────────────────────────────┐
│ What reliability level should we target? │
│ │
│ Consider: │
│ - Current baseline (what are we achieving now?) │
│ - User expectations (what do users tolerate?) │
│ - Business requirements (any SLAs?) │
│ - Cost vs reliability trade-off │
│ │
│ Start achievable, improve iteratively │
│ SLO = Current baseline - small margin │
└─────────────────────────────────────────────────────────────────┘
Step 4: DEFINE ERROR BUDGET POLICY
┌─────────────────────────────────────────────────────────────────┐
│ What happens when budget is exhausted? │
│ │
│ Error Budget Policy: │
│ - Budget > 50%: Normal operations │
│ - Budget 25-50%: Slow down risky changes │
│ - Budget < 25%: Focus on reliability │
│ - Budget = 0%: Feature freeze, reliability only │
│ │
│ Escalation: │
│ - Who gets notified at each threshold? │
│ - What actions are required? │
└─────────────────────────────────────────────────────────────────┘
# SLO: {Service Name} - {Journey/Feature}
## Service Overview
| Attribute | Value |
|-----------|-------|
| Service | [Service name] |
| Owner | [Team name] |
| Criticality | [Critical/High/Medium/Low] |
| User Journey | [Journey name] |
## SLI Specification
### Availability SLI
**Definition:** The proportion of [event type] that [success criteria].
**Good Event:** [What counts as success]
**Bad Event:** [What counts as failure]
**Measurement:**
- Source: [Prometheus/Azure Monitor/etc.]
- Query:
```promql
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Definition: The proportion of requests served within [threshold].
Thresholds:
| Percentile | Threshold |
|---|---|
| P50 | [X]ms |
| P95 | [X]ms |
| P99 | [X]ms |
Measurement:
histogram_quantile(0.95,
rate(http_request_duration_bucket[5m]))
| SLI | Target | Window |
|---|---|---|
| Availability | [99.9%] | 30 days rolling |
| Latency (P95) | [99%] below 500ms | 30 days rolling |
| SLO | Error Budget | Allowed Downtime (30d) |
|---|---|---|
| 99.9% availability | 0.1% | 43m 50s |
| 99% latency | 1% | 7h 18m |
| Budget Remaining | Status | Actions |
|---|---|---|
| > 50% | 🟢 Healthy | Normal operations |
| 25-50% | 🟡 Caution | Review recent changes |
| 10-25% | 🟠 Warning | Slow deployments, reliability focus |
| < 10% | 🔴 Critical | Feature freeze |
| Exhausted | ⛔ Frozen | Reliability-only work |
| Threshold | Notify | Action Required |
|---|---|---|
| < 50% | Team lead | Awareness |
| < 25% | Engineering manager | Review deployment pace |
| < 10% | Director | Feature freeze decision |
| Exhausted | VP Engineering | Incident response mode |
| Severity | Burn Rate | Time Window | Example |
|---|---|---|---|
| Critical | 14.4x | 1h | Budget exhausted in ~2 days |
| Warning | 6x | 6h | Budget exhausted in ~5 days |
| Info | 1x | 3d | Budget on track to exhaust |
- alert: SLOHighBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
description: "Error budget burning at 14.4x rate"
[Include baseline measurements and trends]
// SLO metric implementation in .NET
// Infrastructure/Telemetry/SloMetrics.cs
using System.Diagnostics.Metrics;
public class SloMetrics
{
private readonly Counter<long> _totalRequests;
private readonly Counter<long> _successfulRequests;
private readonly Counter<long> _failedRequests;
private readonly Histogram<double> _requestDuration;
public SloMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("OrdersApi.SLO");
_totalRequests = meter.CreateCounter<long>(
"slo.requests.total",
"{request}",
"Total requests for SLO calculation");
_successfulRequests = meter.CreateCounter<long>(
"slo.requests.successful",
"{request}",
"Successful requests (good events)");
_failedRequests = meter.CreateCounter<long>(
"slo.requests.failed",
"{request}",
"Failed requests (bad events)");
_requestDuration = meter.CreateHistogram<double>(
"slo.request.duration",
"ms",
"Request duration for latency SLI");
}
public void RecordRequest(
string endpoint,
int statusCode,
double durationMs)
{
var tags = new TagList
{
{ "endpoint", endpoint },
{ "status_code", statusCode.ToString() }
};
_totalRequests.Add(1, tags);
// Availability SLI: 5xx = bad, everything else = good
if (statusCode >= 500)
{
_failedRequests.Add(1, tags);
}
else
{
_successfulRequests.Add(1, tags);
}
// Latency SLI
_requestDuration.Record(durationMs, tags);
}
}
// Middleware to capture SLO metrics
public class SloMetricsMiddleware
{
private readonly RequestDelegate _next;
private readonly SloMetrics _sloMetrics;
public SloMetricsMiddleware(RequestDelegate next, SloMetrics sloMetrics)
{
_next = next;
_sloMetrics = sloMetrics;
}
public async Task InvokeAsync(HttpContext context)
{
var stopwatch = Stopwatch.StartNew();
try
{
await _next(context);
}
finally
{
stopwatch.Stop();
var endpoint = context.GetEndpoint()?.DisplayName ?? "unknown";
var statusCode = context.Response.StatusCode;
var durationMs = stopwatch.Elapsed.TotalMilliseconds;
_sloMetrics.RecordRequest(endpoint, statusCode, durationMs);
}
}
}
# Availability SLI (30-day rolling)
1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
)
# Latency SLI (P95 < 500ms, 30-day)
sum(increase(slo_request_duration_bucket{le="500"}[30d]))
/
sum(increase(slo_request_duration_count[30d]))
# Error Budget Remaining (availability)
1 - (
(1 - 0.999) # SLO target (99.9%)
-
(1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
))
) / (1 - 0.999)
# Error Budget Burn Rate (1h)
(
sum(rate(slo_requests_failed_total[1h]))
/
sum(rate(slo_requests_total[1h]))
) / (1 - 0.999) # Divide by error budget (0.1%)
When designing SLOs:
For detailed guidance:
Last Updated: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.