Site reliability engineering -- SLO/SLI/SLA, error budgets, toil, on-call, runbooks, incidents.
From godmodenpx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:reliability, "SRE", "SLO", "error budget"grep -r "healthcheck\|health-check\|/health" \
--include="*.ts" --include="*.py" -l 2>/dev/null
grep -r "pagerduty\|opsgenie\|alertmanager" \
--include="*.yaml" --include="*.yml" -l 2>/dev/null
Service: <name> | Criticality: Tier 1/2/3
Current state: monitoring, alerting, on-call, runbooks
Dependencies: <upstream and downstream>
Hierarchy: SLA (external) -> SLO (internal, stricter) -> SLI (measured metric).
SLI categories: availability (success/total), latency (requests < threshold / total), throughput, correctness, freshness, durability.
Error budget = 1 - SLO.
Errors: HTTP 5xx, timeouts, circuit breaker rejections. NOT 4xx or 429.
IF error rate >0.1%: investigate top 3 error classes. WHEN SLO budget <10% remaining: freeze deploys.
Policy:
50% remaining: normal operations
Multi-window burn rate alerts:
Both windows must trigger (reduces false positives).
Toil = manual, repetitive, automatable, tactical, no enduring value. Inventory with frequency + hours. Target: <50% of team capacity. Automate top 3.
IF toil > 50%: stop feature work, automate.
Minimum 5 engineers. Primary + secondary. Escalation: L1(0m) -> L2(15m) -> L3(30m) -> L4(60m). Health: <5 pages/shift, <1 during sleep, MTTA <5min, MTTR <30min, false positive <20%. Max 1 week in 5. Day off after off-hours SEV1.
Every pageable alert needs: what is happening, user impact, diagnostic steps (commands), mitigation options (commands), escalation, post-incident actions. Levels: L0 Manual -> L1 Assisted -> L2 Semi-auto -> L3 Full auto.
Lifecycle: Detection -> Triage -> Mitigation -> Resolution -> Post-mortem -> Prevention. Severity: SEV1 (<15min), SEV2 (<30min), SEV3 (<2h), SEV4 (next day). Roles: IC, Tech Lead, Comms, Scribe.
SLOs, error budget alerts, dashboards, logging, tracing, alerts, runbooks, on-call, circuit breakers, timeouts, auto-scaling, canary deploy, rollback.
Append .godmode/reliability-results.tsv:
timestamp service slo_count budget_remaining_pct alerts runbooks status
KEEP if: SLO measurement works AND alerts fire
correctly AND runbook is actionable.
DISCARD if: false positives OR measurement broken
OR runbook is vague.
STOP when ALL of:
- SLOs defined and measurable (tier-1 services)
- Burn rate alerts configured and tested
- On-call rotation active with escalation
- Runbooks exist for all critical alerts
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| SLO always breached | Set at current p95, improve later |
| Alert fatigue | Increase threshold, multi-signal alerts |
| Runbook outdated | Add to deploy checklist, test quarterly |
| Budget depleted fast | Freeze deploys, fix top errors |