Skill

slo-designer

Designs Service Level Objectives (SLOs) with SLIs, targets, alerting thresholds, and error budget calculations per Google SRE best practices. Use for defining reliability targets or service indicators.

devops

monitoring

Install

npx claudepluginhub rjmurillo/ai-agents --plugin project-toolkit

Tool Access

This skill uses the workspace's default tool permissions.

Preview

When this skill activates, you guide users through designing production-ready Service Level Objectives. Your role is to help identify critical user journeys, define measurable SLIs, set appropriate targets, and calculate error budgets.

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

163.7k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

163.7k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

163.7k

Stats

Parent Repo Stars15

Parent Repo Forks4

Last CommitFeb 8, 2026

Actions

View Source View Plugin View on GitHub View README

SLO Designer

Triggers

Activate when the user:

Design SLOs for my service
Define reliability targets
Calculate error budget
Define SLIs for this system
What should my availability target be?

When to Use

Use this skill when:

Defining reliability targets for a new or existing service
Calculating error budgets for capacity planning
Establishing SLIs and alerting thresholds
Stakeholders ask "what should our availability target be?"

Use chaos-experiment instead when:

Testing whether the system meets its SLOs through failure injection
Validating resilience mechanisms work as designed

Use threat-modeling instead when:

Analyzing security threats, not reliability targets

Core Concepts

Term	Definition	Example
SLI	Service Level Indicator. Metric measuring service quality.	p99 latency, availability %
SLO	Service Level Objective. Target value for an SLI.	p99 < 200ms, 99.9% availability
SLA	Service Level Agreement. Contract with consequences.	99.95% uptime or credits issued
Error Budget	Allowed failures before SLO breach.	0.1% = 43 min/month downtime
Burn Rate	Speed of error budget consumption.	2x burn = budget exhausted in 15 days

Common SLI Categories

Availability

Percentage of successful requests.

availability_sli = (successful_requests / total_requests) * 100

Good for: APIs, web services, databases.

Latency

Response time percentiles (p50, p95, p99).

latency_sli = percentile(response_times, 99)

Good for: User-facing endpoints, real-time systems.

Throughput

Requests per second (RPS) or transactions.

throughput_sli = requests_per_second / expected_baseline

Good for: Batch processing, high-volume systems.

Error Rate

Percentage of 5xx responses.

error_rate_sli = (error_responses / total_responses) * 100

Good for: APIs, microservices.

Correctness

Percentage of correct results.

correctness_sli = (correct_results / total_results) * 100

Good for: Data pipelines, ML inference, calculations.

Process

1. DISCOVERY          Identify critical user journeys
        |             What matters to users?
        v
2. SLI DEFINITION     Select measurable indicators
        |             How do we measure success?
        v
3. SLO TARGETS        Set achievable targets
        |             What should we promise?
        v
4. ERROR BUDGET       Calculate allowed failures
        |             How much can we fail?
        v
5. ALERTING           Define burn rate alerts
        |             When do we intervene?
        v
6. DOCUMENTATION      Generate SLO document

Scripts

calculate_error_budget.py

Calculate error budget for a given SLO target:

python3 .claude/skills/slo-designer/scripts/calculate_error_budget.py \
  --target 99.9 \
  --period monthly

Arguments:

Argument	Required	Description
`--target`	Yes	SLO target percentage (e.g., 99.9)
`--period`	No	Time period: monthly, weekly, daily, quarterly (default: monthly)
`--format`	No	Output format: text, json, markdown (default: text)

Exit Codes:

0: Success
1: Invalid arguments
2: Calculation error

generate_slo_document.py

Generate a complete SLO document from configuration:

python3 .claude/skills/slo-designer/scripts/generate_slo_document.py \
  --config path/to/slo-config.yaml \
  --output docs/slo-document.md

Question Framework

Use these questions to gather requirements:

1. Service Context

What is the service name and purpose?
Who are the primary users (internal, external, both)?
What is the business criticality (revenue impact)?
Are there existing SLAs or customer expectations?

2. User Journeys

What are the 3-5 most critical user journeys?
What actions must succeed for users to be satisfied?
What is the acceptable response time for each journey?
Which failures are most impactful?

3. Current State

What metrics are already collected?
What is the current availability (if known)?
What are the current p50, p95, p99 latencies?
What is the current error rate?

4. Infrastructure

What is the deployment architecture?
Are there external dependencies?
What is the disaster recovery capability?
What maintenance windows exist?

5. Targets

What availability level is appropriate?
What latency targets align with user expectations?
How much error budget can the team responsibly manage?
What alerting thresholds make sense?

SLO Target Guidelines

Service Type	Typical Availability	Latency (p99)	Error Rate
Consumer Web	99.9% (43 min/month)	< 500ms	< 1%
Internal API	99.5% (3.6 hr/month)	< 1s	< 2%
B2B Critical	99.95% (22 min/month)	< 200ms	< 0.1%
Batch Jobs	99% (7.3 hr/month)	N/A	< 5%
Real-time	99.99% (4 min/month)	< 100ms	< 0.01%

Choosing a target:

Start conservative (lower targets are easier to meet)
Consider external dependencies (chain reliability)
Account for maintenance windows
Leave margin from SLA (SLO should be tighter)

Error Budget Table

SLO Target	Error Budget	Monthly Downtime	Weekly Downtime
99%	1%	7h 18m	1h 41m
99.5%	0.5%	3h 39m	50m
99.9%	0.1%	43m 50s	10m
99.95%	0.05%	21m 55s	5m
99.99%	0.01%	4m 23s	1m
99.999%	0.001%	26s	6s

Burn Rate Alerting

Configure alerts based on budget consumption rate:

Alert Severity	Burn Rate	Time to Exhaust	Action
Warning	1x	30 days	Monitor closely
Elevated	2x	15 days	Investigate
Urgent	6x	5 days	Prioritize fix
Critical	14.4x	2 days	Immediate action
Emergency	36x	20 hours	All hands

Multi-window alerting:

Alert if:
  burn_rate_1h > 14.4 AND burn_rate_6h > 6
  OR
  burn_rate_6h > 6 AND burn_rate_24h > 2

Output Template

Generate this structure:

# SLO Document: [Service Name]

## Service Overview
- **Name**: [Service name]
- **Owner**: [Team name]
- **Description**: [What the service does]
- **Business Criticality**: [Low/Medium/High/Critical]

## Critical User Journeys
1. [Journey 1]: [Description]
2. [Journey 2]: [Description]
3. [Journey 3]: [Description]

## Service Level Indicators

### SLI 1: Availability
- **Definition**: Percentage of successful HTTP requests
- **Measurement**: `sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
- **Data Source**: Prometheus metrics

### SLI 2: Latency
- **Definition**: 99th percentile response time
- **Measurement**: `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`
- **Data Source**: Prometheus metrics

## Service Level Objectives

| SLI | Target | Measurement Window | Rationale |
|-----|--------|-------------------|-----------|
| Availability | 99.9% | 30-day rolling | Industry standard for user-facing APIs |
| Latency (p99) | < 200ms | 30-day rolling | User research shows frustration above 200ms |

## Error Budgets

| SLO | Error Budget | Monthly Allowance | Current Consumption |
|-----|--------------|-------------------|---------------------|
| Availability 99.9% | 0.1% | 43 minutes | [Current] |
| Latency p99 < 200ms | 0.1% | 43 minutes | [Current] |

## Alerting Strategy

### Page-worthy Alerts (Critical)
- Burn rate > 14.4x for 1 hour AND > 6x for 6 hours
- Action: Immediate response required

### Ticket-worthy Alerts (Warning)
- Burn rate > 2x for 24 hours
- Action: Investigate within 1 business day

## Implementation Checklist
- [ ] Metrics collection configured
- [ ] SLO dashboard created
- [ ] Alerts configured
- [ ] Runbook documented
- [ ] Team trained on error budget policy

Related Concepts

SLO vs SLA

SLO: Internal target. No contractual penalty.
SLA: External contract. Financial consequences.
Best Practice: SLO should be stricter than SLA to provide buffer.

Error Budget Policy

When error budget is exhausted:

Freeze non-critical feature work
Prioritize reliability improvements
Conduct incident reviews
Address technical debt

When error budget is healthy:

Invest in new features
Accept more risk
Run experiments

Anti-Patterns

Avoid	Why	Instead
Setting SLO equal to SLA	No buffer for error budget	SLO should be stricter than SLA
Targeting 100% availability	Impossible and prevents feature velocity	Use 99.9% or lower based on service type
Internal metrics as SLIs	Do not reflect user experience	Measure from user perspective (latency, errors)
No error budget policy	SLOs become meaningless targets	Define actions when budget is exhausted
Same SLO for all services	Different services have different needs	Match target to business criticality

Verification

After designing SLOs:

SLIs defined with specific measurement queries
SLO targets set with rationale documented
Error budgets calculated for each SLO
Alerting thresholds defined (burn rate based)
Error budget policy documented (what happens when exhausted)
SLO document generated and reviewed by stakeholders

slo-designer

Install

Tool Access

Preview

SKILL.md

Similar Skills

slo-designer

Install

Tool Access

Preview

SKILL.md

SLO Designer

Triggers

When to Use

Core Concepts

Common SLI Categories

Availability

Latency

Throughput

Error Rate

Correctness

Process

Scripts

calculate_error_budget.py

generate_slo_document.py

Question Framework

1. Service Context

2. User Journeys

3. Current State

4. Infrastructure

5. Targets

SLO Target Guidelines

Error Budget Table

Burn Rate Alerting

Output Template

Related Concepts

SLO vs SLA

Error Budget Policy

Anti-Patterns

Verification

References

Similar Skills

SLO Designer

Triggers

When to Use

Core Concepts

Common SLI Categories

Availability

Latency

Throughput

Error Rate

Correctness

Process

Scripts

calculate_error_budget.py

generate_slo_document.py

Question Framework

1. Service Context

2. User Journeys

3. Current State

4. Infrastructure

5. Targets

SLO Target Guidelines

Error Budget Table

Burn Rate Alerting

Output Template

Related Concepts

SLO vs SLA

Error Budget Policy

Anti-Patterns

Verification

References