Help us improve
Share bugs, ideas, or general feedback.
SRE expert for SLI/SLO/SLA setup, error budgets, golden signals, incident response processes, monitoring/alerting configs, and toil reduction best practices.
npx claudepluginhub devsforge/marketplace --plugin sre-reliability-engineerHow this agent operates — its isolation, permissions, and tool access model
Agent reference
sre-reliability-engineer:agents/sre-expertThe summary Claude sees when deciding whether to delegate to this agent
A practical guide to Site Reliability Engineering practices including SLI/SLO/SLA definitions, incident response, monitoring, and best practices. - **Error Budgets**: Balance reliability and feature velocity (1 - SLO target) - **Toil Reduction**: Minimize repetitive manual work (target < 50% of time) - **Monitoring**: White-box and black-box monitoring with actionable alerts - **Emergency Respo...
SRE agent specializing in system reliability: defines SLOs/SLIs/SLAs, manages error budgets, incident triage/response, and monitoring for production systems.
SRE expert for monitoring, observability, incident response, SLOs, error budgets, capacity planning, and reliable distributed systems. Delegate complex SRE analysis, runbooks, and reliability designs.
SRE agent for defining SLOs/SLIs, managing error budgets, leading incident response, writing blameless postmortems, and ensuring production reliability.
Share bugs, ideas, or general feedback.
A practical guide to Site Reliability Engineering practices including SLI/SLO/SLA definitions, incident response, monitoring, and best practices.
Quantitative measures of service level:
Target values for SLIs:
const sloExample = {
availability: {
target: 99.9, // 99.9% uptime
window: '30 days',
errorBudget: 0.1 // 43.2 minutes/month
},
latency: {
p95: 200, // 95th percentile < 200ms
p99: 500, // 99th percentile < 500ms
}
};
Error Budget Formula: (1 - Actual Uptime) / (1 - SLO Target)
Contracts with consequences:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
SEV1 - Critical
SEV2 - High
SEV3 - Medium
SEV4 - Low
# Post-Mortem: [Incident Title]
**Date**: YYYY-MM-DD
**Severity**: SEV#
**Duration**: X hours Y minutes
**Impact**: X users affected
## What Happened
[Brief technical description]
## Root Cause
[Why it happened]
## Timeline
| Time | Event |
|------|-------|
| 14:00 | Issue detected |
| 14:05 | Team engaged |
| 14:20 | Service restored |
## What Went Well
- Quick detection
- Effective communication
## What Went Wrong
- No monitoring for X
- Insufficient testing
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| Add monitoring | SRE | P0 | 2024-04-15 |
| Update runbook | DevOps | P1 | 2024-04-20 |
Reliability
Monitoring
Incidents
Automation
Culture