Skill

slo-workshop

Interactive SLO definition workshop - guides through defining SLIs, setting SLO targets, and establishing error budget policies for a service

From systems-design
Install
1
Run in your terminal
$
npx claudepluginhub melodic-software/claude-code-plugins --plugin systems-design
Tool Access

This skill is limited to using the following tools:

ReadGlobGrepTaskAskUserQuestion
Skill Content

SLO Workshop Command

This command runs an interactive workshop to help define SLOs (Service Level Objectives) for a service.

Purpose

Guide teams through the complete SLO definition process:

  1. Identifying critical user journeys
  2. Selecting appropriate SLIs (Service Level Indicators)
  3. Setting realistic SLO targets
  4. Establishing error budget policies
  5. Designing alerting strategies

Workflow

Phase 1: Service Understanding

First, understand the service context:

If a service name or file is provided:

  • Search the codebase for the service
  • Identify endpoints, dependencies, and user-facing functionality
  • Look for existing metrics, SLOs, or monitoring configuration

Gather context through questions:

  1. What does this service do for users?
  2. Who are the primary users (internal/external)?
  3. What are the critical user journeys?
  4. What does "working correctly" mean for users?

Phase 2: SLI Selection

Guide through selecting meaningful SLIs:

Present SLI categories:

Common SLI Types:

1. Availability
   "Can users access the service?"
   Measurement: Successful requests / Total requests

2. Latency
   "How fast does the service respond?"
   Measurement: Request duration at percentile (p50, p90, p99)

3. Correctness
   "Does the service return correct results?"
   Measurement: Correct responses / Total responses

4. Throughput
   "Can the service handle the load?"
   Measurement: Requests processed per time unit

5. Freshness
   "How current is the data?"
   Measurement: Age of data served to users

For each relevant SLI type, define:

  • What counts as a "good" event
  • What counts as a "valid" event (denominator)
  • How it will be measured (metrics, logs, synthetic)

Phase 3: SLO Target Setting

Help set appropriate targets:

Consider factors:

  • Current baseline (what are we achieving today?)
  • User expectations (what do users need?)
  • Engineering capacity (what can we sustain?)
  • Business requirements (what's contractually required?)

Provide guidance:

SLO Target Guidance:

Starting Point Recommendations:
- Availability: Start at current baseline - 0.1%
- Latency: Start at current p99 + 20% buffer

Common Targets:
- 99.9% = 43 minutes downtime/month
- 99.5% = 3.6 hours downtime/month
- 99% = 7.3 hours downtime/month

Tips:
- Don't start at 100% (impossible to maintain)
- Don't set targets you can't measure
- Conservative targets are easier to achieve
- You can tighten targets over time

Phase 4: Error Budget Policy

Define what happens when the error budget is consumed:

Error budget calculation:

Error Budget = 100% - SLO Target

Example:
SLO = 99.9% availability
Error Budget = 0.1% = 43.2 minutes/month

Policy framework:

Error Budget Policy Template:

Budget > 50%:
- Normal development velocity
- Standard change process

Budget 25-50%:
- Increased review for risky changes
- Prioritize reliability improvements

Budget < 25%:
- Pause non-critical feature work
- Focus on reliability improvements

Budget exhausted:
- Stop all non-critical deployments
- All hands on reliability
- Postmortem for budget-burning incidents

Phase 5: Alerting Strategy

Design multi-window burn rate alerting:

Explain burn rate concept:

Burn Rate Alerting:

Burn rate = Rate of consuming error budget

1x burn rate = Exactly consuming monthly budget
2x burn rate = Will exhaust budget in 15 days
10x burn rate = Will exhaust budget in 3 days

Multi-window alerts:
- Fast burn: 14.4x rate over 1 hour (page)
- Slow burn: 3x rate over 3 days (ticket)

Define alert thresholds based on SLO targets

Phase 6: Documentation

Generate SLO documentation:

# [Service Name] SLO Definition

## Service Overview
[Description from workshop]

## Critical User Journeys
1. [Journey 1]
2. [Journey 2]

## SLIs

### [SLI Name]
- Type: [Availability/Latency/etc.]
- Definition: [How measured]
- Good event: [What counts as good]
- Valid event: [What counts as valid]

## SLO Targets

| SLI | Target | Window | Error Budget |
|-----|--------|--------|--------------|
| [SLI 1] | [%] | [days] | [time] |

## Error Budget Policy

### Budget > 50%
[Actions]

### Budget 25-50%
[Actions]

### Budget < 25%
[Actions]

### Budget Exhausted
[Actions]

## Alerting

| Alert | Burn Rate | Window | Severity |
|-------|-----------|--------|----------|
| [Name] | [rate]x | [time] | [Page/Ticket] |

## Review Schedule
- Quarterly SLO review
- Monthly error budget review
- After significant incidents

Usage Examples

# Start workshop for a specific service
/sd:slo-workshop order-service

# Start workshop with context file
/sd:slo-workshop @docs/services/payment-api.md

# Start general workshop
/sd:slo-workshop

Interactive Elements

Throughout the workshop, use AskUserQuestion to:

  • Gather service context
  • Validate SLI selections
  • Confirm target appropriateness
  • Review error budget policies

Output

The workshop produces:

  1. SLO Definition Document - Complete SLO specification
  2. Implementation Checklist - Steps to implement the SLOs
  3. Review Schedule - When to revisit and adjust

Related Skills

This command leverages:

  • slo-sli-error-budget - SLO methodology details
  • observability-patterns - Measurement approaches
  • distributed-tracing - Trace-based SLIs

Related Agent

For SLO consultation without interactive workshop:

  • observability-consultant - General observability guidance
Stats
Parent Repo Stars40
Parent Repo Forks6
Last CommitFeb 15, 2026