slo-workshop
Interactive SLO definition workshop - guides through defining SLIs, setting SLO targets, and establishing error budget policies for a service
From systems-designnpx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designThis skill is limited to using the following tools:
SLO Workshop Command
This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.
Purpose
Guide teams through the complete SLO definition process:
- Identifying critical user journeys
- Selecting appropriate SLIs (Service Level Indicators)
- Setting realistic SLO targets
- Establishing error budget policies
- Designing alerting strategies
Workflow
Phase 1: Service Understanding
First, understand the service context:
If a service name or file is provided:
- Search the codebase for the service
- Identify endpoints, dependencies, and user-facing functionality
- Look for existing metrics, SLOs, or monitoring configuration
Gather context through questions:
- What does this service do for users?
- Who are the primary users (internal/external)?
- What are the critical user journeys?
- What does "working correctly" mean for users?
Phase 2: SLI Selection
Guide through selecting meaningful SLIs:
Present SLI categories:
Common SLI Types:
1. Availability
"Can users access the service?"
Measurement: Successful requests / Total requests
2. Latency
"How fast does the service respond?"
Measurement: Request duration at percentile (p50, p90, p99)
3. Correctness
"Does the service return correct results?"
Measurement: Correct responses / Total responses
4. Throughput
"Can the service handle the load?"
Measurement: Requests processed per time unit
5. Freshness
"How current is the data?"
Measurement: Age of data served to users
For each relevant SLI type, define:
- What counts as a "good" event
- What counts as a "valid" event (denominator)
- How it will be measured (metrics, logs, synthetic)
Phase 3: SLO Target Setting
Help set appropriate targets:
Consider factors:
- Current baseline (what are we achieving today?)
- User expectations (what do users need?)
- Engineering capacity (what can we sustain?)
- Business requirements (what's contractually required?)
Provide guidance:
SLO Target Guidance:
Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer
Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month
Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time
Phase 4: Error Budget Policy
Define what happens when the error budget is consumed:
Error budget calculation:
Error Budget = 100% - SLO Target
Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month
Policy framework:
Error Budget Policy Template:
Budget > 50%:
- Normal development velocity
- Standard change process
Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements
Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements
Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents
Phase 5: Alerting Strategy
Design multi-window burn rate alerting:
Explain burn rate concept:
Burn Rate Alerting:
Burn rate = Rate of consuming error budget
1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days
Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)
Define alert thresholds based on SLO targets
Phase 6: Documentation
Generate SLO documentation:
# [Service Name] SLO Definition
## Service Overview
[Description from workshop]
## Critical User Journeys
1. [Journey 1]
2. [Journey 2]
## SLIs
### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]
## SLO Targets
| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |
## Error Budget Policy
### Budget > 50%
[Actions]
### Budget 25-50%
[Actions]
### Budget < 25%
[Actions]
### Budget Exhausted
[Actions]
## Alerting
| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |
## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents
Usage Examples
# Start workshop for a specific service
/sd:slo-workshop order-service
# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md
# Start general workshop
/sd:slo-workshop
Interactive Elements
Throughout the workshop, use AskUserQuestion to:
- Gather service context
- Validate SLI selections
- Confirm target appropriateness
- Review error budget policies
Output
The workshop produces:
- SLO Definition Document - Complete SLO specification
- Implementation Checklist - Steps to implement the SLOs
- Review Schedule - When to revisit and adjust
Related Skills
This command leverages:
slo-sli-error-budget- SLO methodology detailsobservability-patterns- Measurement approachesdistributed-tracing- Trace-based SLIs
Related Agent
For SLO consultation without interactive workshop:
observability-consultant- General observability guidance