Skill

slo

Defines SLOs, SLIs, error budgets, burn rates, reliability targets, and alerting policies for APIs, pipelines, and services.

Prometheus

Bash

monitoring

devops

npx claudepluginhub arbazkhan971/godmode

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/godmode:slo

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- `/godmode:slo`, "SLO", "SLI", "SLA", "error budget"

SKILL.md

190 lines · ~1.4k tokens

Similar Skills

slo-implementation

37.9k

Defines and implements SLIs, SLOs, and error budgets for service reliability using PromQL queries for availability/latency, YAML configs, and downtime calculations. Useful for reliability targets and alerts.

antigravity-bundle-observability-monitoring

define-slo-sli-sla

Establishes SLOs, SLIs, SLAs with error budget tracking, burn rate alerts, and reporting using Prometheus, Sloth, or Pyrra. For defining reliability targets in customer-facing services and SRE practices.

1 file1 tool

agent-almanac

slo-sli-error-budget

Guides defining SLOs, selecting SLIs, and implementing error budget policies for service reliability, alerting, and balancing velocity.

3 tools

systems-design

Stats

LanguageShell

Stars18

Forks8

MaintenanceExcellent

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

SLO -- Service Level Objectives

Activate When

/godmode:slo, "SLO", "SLI", "SLA", "error budget"
"reliability target", "uptime", "burn rate"
Defining reliability contracts for services
Incident reveals SLO violations went undetected

Workflow

Step 1: Service Context

Service: <name and purpose>
Type: User-facing API | Internal API | Pipeline | Batch
Criticality: Tier 1 (revenue) | Tier 2 | Tier 3
Current Reliability: <observed uptime, error rate>
Traffic: <steady | diurnal | spiky | seasonal>
Request Volume: <RPS or per day>

Step 2: SLA vs SLO vs SLI

SLA: External contract. Breach = legal/financial.
  Example: "99.5% uptime/month or credit issued."
SLO: Internal target, stricter than SLA.
  Drives priorities via error budgets.
SLI: Measured indicator.
  Example: successful requests / total requests.

Step 3: SLI Selection

USER-FACING API:
  Availability: success / total (exclude 5xx)
  Latency: requests below threshold / total
  Error rate: (5xx + timeouts) / total

PIPELINE / BATCH:
  Freshness: data age < threshold
  Completeness: records processed / expected
  Correctness: valid outputs / total outputs

ERROR CLASSIFICATION:
  5xx = YES | Timeout = YES | 4xx = NO
  429 rate-limited = DOCUMENT CHOICE
  Dependency failure = YES (user doesn't care why)

# Measure current SLIs
curl -sf localhost:9090/api/v1/query \
  --data-urlencode 'query=sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))' \
  | jq '.data.result[0].value[1]'

Step 4: SLO Targets

| SLI          | SLO Target | Window | SLA     |
|-------------|-----------|--------|---------|
| Availability | 99.9%     | 30d    | 99.5%   |
| Latency p50 | < 100ms   | 30d    | < 200ms |
| Latency p99 | < 500ms   | 30d    | < 1000ms|
| Error rate  | < 0.1%    | 30d    | < 0.5%  |

Step 5: Error Budget Calculation

Error budget = 1 - SLO target

EXAMPLE: 99.9% availability, 30 days
  Budget: 43,200 min * 0.001 = 43.2 min/month

REFERENCE TABLE:
| SLO   | Budget  | Min/month | Sec/day  |
|-------|---------|----------|----------|
| 99%   | 1%      | 432 min  | 864 sec  |
| 99.5% | 0.5%    | 216 min  | 432 sec  |
| 99.9% | 0.1%    | 43.2 min | 86.4 sec |
| 99.95%| 0.05%   | 21.6 min | 43.2 sec |
| 99.99%| 0.01%   | 4.3 min  | 8.6 sec  |

Step 6: Burn Rate Alerts

Burn rate = observed error rate / max allowed rate
  1x = normal | 2x = exhausts in 15d
  10x = exhausts in 3d | 36x = in 20h

MULTI-WINDOW ALERTS (Google SRE Workbook):
| Severity | Burn | Long Win | Short Win |
|----------|------|----------|-----------|
| Critical | 14.4 | 1h       | 5min      |
| High     | 6    | 6h       | 30min     |
| Medium   | 3    | 1d       | 2h        |
| Low      | 1    | 3d       | 6h        |

Step 7: Error Budget Policy

> 75% remaining: Full velocity. Ship freely.
50-75%: Normal velocity, review risky changes.
25-50%: Slow down. Extra review for critical paths.
10-25%: Freeze non-critical. Reliability only.
< 10%: Full freeze. All effort to reliability.

Step 8: Release Gating

IF budget < 10%: block deployment
IF canary error rate > 2x baseline: auto-rollback
IF budget dropped > 5% in 1h: alert + investigate
IF progressive rollout: pause on budget spike

Step 9: Composite SLOs

User journey "Complete Purchase":
  Product API 99.9% * Cart 99.9% * Checkout 99.95%
  * Payment 99.99% * Notification 99.5%
  = 99.24% composite availability
Optimize the weakest link, not the strongest.

Step 10: Dashboard & Review

Panels: SLO summary, budget time series (30d with threshold lines at 75%/50%/25%/10%), burn rate.

Review: Weekly (team), Monthly (team + mgmt), Quarterly (leadership, targets + SLA alignment).

Key Behaviors

SLOs make reliability measurable.
Error budgets balance velocity and reliability.
Burn rate alerts replace threshold alerts.
Composite SLOs reflect user experience.
Start conservative, tighten incrementally.
Never ask to continue. Loop autonomously.

HARD RULES

NEVER set SLO = 100%. Zero budget = zero deploys.
NEVER set SLO = SLA. Gap is your safety margin.
NEVER use averages as SLIs. Use proportional.
NEVER define SLOs without error budget policy.
ALWAYS require both long AND short alert windows.
ALWAYS measure SLIs closer to the user.
ALWAYS start with current measured reliability.

Auto-Detection

grep -r "prometheus\|datadog\|grafana\|newrelic" \
  --include="*.yml" --include="*.yaml" -l 2>/dev/null
find . -name "*slo*" -o -name "*service-level*" \
  2>/dev/null | head -5

TSV Logging

Log to .godmode/slo-results.tsv: timestamp\tservice\tslis\tslos\tbudget_pct\tstatus

Output Format

SLO: {service} | SLIs: {N} | SLOs: {N}
Budget: {N} min/month ({N}% remaining)
Alerts: {N} configured | Policy: {status}
Verdict: SLO READY | NEEDS WORK

Keep/Discard Discipline

KEEP if: SLO < current reliability AND SLO > SLA
  AND error budget policy documented
DISCARD if: SLO = 100% OR SLO = SLA
  OR burn rate alerts fire on resolved problems

Stop Conditions

STOP when:
  - SLIs defined, SLOs set, budgets calculated
  - Burn rate alerts configured, policy documented
  - Dashboard live AND release gating configured
  - User requests stop OR max 10 iterations

slo

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

slo

Popularity

Invocation

Context Preview

SKILL.md

SLO -- Service Level Objectives

Activate When

Workflow

Step 1: Service Context

Step 2: SLA vs SLO vs SLI

Step 3: SLI Selection

Step 4: SLO Targets

Step 5: Error Budget Calculation

Step 6: Burn Rate Alerts

Step 7: Error Budget Policy

Step 8: Release Gating

Step 9: Composite SLOs

Step 10: Dashboard & Review

Key Behaviors

HARD RULES

Auto-Detection

TSV Logging

Output Format

Keep/Discard Discipline

Stop Conditions

Similar Skills

Help us improve

SLO -- Service Level Objectives

Activate When

Workflow

Step 1: Service Context

Step 2: SLA vs SLO vs SLI

Step 3: SLI Selection

Step 4: SLO Targets

Step 5: Error Budget Calculation

Step 6: Burn Rate Alerts

Step 7: Error Budget Policy

Step 8: Release Gating

Step 9: Composite SLOs

Step 10: Dashboard & Review

Key Behaviors

HARD RULES

Auto-Detection

TSV Logging

Output Format

Keep/Discard Discipline

Stop Conditions