SLO Workshop Command

This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.

Purpose

Guide teams through the complete SLO definition process:

Identifying critical user journeys
Selecting appropriate SLIs (Service Level Indicators)
Setting realistic SLO targets
Establishing error budget policies
Designing alerting strategies

Workflow

Phase 1: Service Understanding

First, understand the service context:

If a service name or file is provided:

Search the codebase for the service
Identify endpoints, dependencies, and user-facing functionality
Look for existing metrics, SLOs, or monitoring configuration

Gather context through questions:

What does this service do for users?
Who are the primary users (internal/external)?
What are the critical user journeys?
What does "working correctly" mean for users?

Phase 2: SLI Selection

Guide through selecting meaningful SLIs:

Present SLI categories:

Common SLI Types:

1. Availability
   "Can users access the service?"
   Measurement: Successful requests / Total requests

2. Latency
   "How fast does the service respond?"
   Measurement: Request duration at percentile (p50, p90, p99)

3. Correctness
   "Does the service return correct results?"
   Measurement: Correct responses / Total responses

4. Throughput
   "Can the service handle the load?"
   Measurement: Requests processed per time unit

5. Freshness
   "How current is the data?"
   Measurement: Age of data served to users

For each relevant SLI type, define:

What counts as a "good" event
What counts as a "valid" event (denominator)
How it will be measured (metrics, logs, synthetic)

Phase 3: SLO Target Setting

Help set appropriate targets:

Consider factors:

Current baseline (what are we achieving today?)
User expectations (what do users need?)
Engineering capacity (what can we sustain?)
Business requirements (what's contractually required?)

Provide guidance:

SLO Target Guidance:

Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer

Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month

Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time

Phase 4: Error Budget Policy

Define what happens when the error budget is consumed:

Error budget calculation:

Error Budget = 100% - SLO Target

Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month

Policy framework:

Error Budget Policy Template:

Budget > 50%:
- Normal development velocity
- Standard change process

Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements

Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements

Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents

Phase 5: Alerting Strategy

Design multi-window burn rate alerting:

Explain burn rate concept:

Burn Rate Alerting:

Burn rate = Rate of consuming error budget

1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days

Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)

Define alert thresholds based on SLO targets

Phase 6: Documentation

Generate SLO documentation:

# [Service Name] SLO Definition

## Service Overview
[Description from workshop]

## Critical User Journeys
1. [Journey 1]
2. [Journey 2]

## SLIs

### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]

## SLO Targets

| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |

## Error Budget Policy

### Budget > 50%
[Actions]

### Budget 25-50%
[Actions]

### Budget < 25%
[Actions]

### Budget Exhausted
[Actions]

## Alerting

| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |

## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents

Usage Examples

# Start workshop for a specific service
/sd:slo-workshop order-service

# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md

# Start general workshop
/sd:slo-workshop

Interactive Elements

Throughout the workshop, use AskUserQuestion to:

Gather service context
Validate SLI selections
Confirm target appropriateness
Review error budget policies

Output

The workshop produces:

SLO Definition Document - Complete SLO specification
Implementation Checklist - Steps to implement the SLOs
Review Schedule - When to revisit and adjust

Related Skills

This command leverages:

slo-sli-error-budget - SLO methodology details
observability-patterns - Measurement approaches
distributed-tracing - Trace-based SLIs

Related Agent

For SLO consultation without interactive workshop:

observability-consultant - General observability guidance