From aigroup-workflow
Defines SLOs/SLIs, error budgets, incident response procedures, capacity models, monitoring configs, and automation scripts for production systems. Use for reliability at scale, chaos engineering, toil reduction, capacity planning.
npx claudepluginhub codeape-7/ai-agent-workflowgroupThis skill uses the workspace's default tool permissions.
1. **Assess reliability** - Review architecture, SLOs, incidents, toil levels
Defines SLOs/SLIs, error budgets, incident procedures, capacity models, monitoring configs, and automation scripts for production systems. For SRE tasks like reliability at scale, chaos engineering, toil reduction.
Implements SRE practices for production reliability: SLO/SLI definitions, monitoring/alerting, chaos engineering, incident runbooks, capacity planning. Handles brownfield extensions.
Designs Service Level Objectives (SLOs) with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets, calculating error budgets, or establishing service indicators.
Share bugs, ideas, or general feedback.
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets | references/error-budget-policy.md | Managing budgets, burn rates, policies |
| Monitoring | references/monitoring-alerting.md | Golden signals, alert design, dashboards |
| Automation | references/automation-toil.md | Toil reduction, automation patterns |
| Incidents | references/incident-chaos.md | Incident response, chaos engineering |
When implementing SRE practices, provide:
# 99.9% availability SLO over a 30-day window
# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month
# Error budget (request-based): 0.001 * total_requests
# Example: 10M requests/month → 10,000 error budget requests
# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window
# → Trigger error budget policy: freeze non-critical releases
groups:
- name: slo_availability
rules:
# Fast burn: 2% budget in 1h (14.4x burn rate)
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.014400
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.014400
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate detected"
runbook: "https://wiki.internal/runbooks/high-error-burn"
# Slow burn: 5% budget in 6h (1x burn rate sustained)
- alert: SlowErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 0.001
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget consumption"
runbook: "https://wiki.internal/runbooks/slow-error-burn"
# Latency — 99th percentile request duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Traffic — requests per second by service
sum(rate(http_requests_total[5m])) by (service)
# Errors — error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Saturation — CPU throttling ratio
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)
#!/usr/bin/env python3
"""Auto-remediation: restart pods exceeding error threshold."""
import subprocess, sys, json
ERROR_THRESHOLD = 0.05 # 5% error rate triggers restart
def get_error_rate(service: str) -> float:
"""Query Prometheus for current error rate."""
import urllib.request
query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))'
url = f"http://prometheus:9090/api/v1/query?query={urllib.request.quote(query)}"
with urllib.request.urlopen(url) as resp:
data = json.load(resp)
results = data["data"]["result"]
return float(results[0]["value"][1]) if results else 0.0
def restart_deployment(namespace: str, deployment: str) -> None:
subprocess.run(
["kubectl", "rollout", "restart", f"deployment/{deployment}", "-n", namespace],
check=True
)
print(f"Restarted {namespace}/{deployment}")
if __name__ == "__main__":
service, namespace, deployment = sys.argv[1], sys.argv[2], sys.argv[3]
rate = get_error_rate(service)
print(f"Error rate for {service}: {rate:.2%}")
if rate > ERROR_THRESHOLD:
restart_deployment(namespace, deployment)
else:
print("Within SLO threshold — no action required")