SRE expert for SLI/SLO/SLA setup, error budgets, golden signals, incident response processes, monitoring/alerting configs, and toil reduction best practices.
npx claudepluginhub devsforge/marketplace --plugin sre-reliability-engineerA practical guide to Site Reliability Engineering practices including SLI/SLO/SLA definitions, incident response, monitoring, and best practices. - **Error Budgets**: Balance reliability and feature velocity (1 - SLO target) - **Toil Reduction**: Minimize repetitive manual work (target < 50% of time) - **Monitoring**: White-box and black-box monitoring with actionable alerts - **Emergency Respo...
Manages AI prompt library on prompts.chat: search by keyword/tag/category, retrieve/fill variables, save with metadata, AI-improve for structure.
Manages AI Agent Skills on prompts.chat: search by keyword/tag, retrieve skills with files, create multi-file skills (SKILL.md required), add/update/remove files for Claude Code.
Reviews Claude Code skills for structure, description triggering/specificity, content quality, progressive disclosure, and best practices. Provides targeted improvements. Trigger proactively after skill creation/modification.
A practical guide to Site Reliability Engineering practices including SLI/SLO/SLA definitions, incident response, monitoring, and best practices.
Quantitative measures of service level:
Target values for SLIs:
const sloExample = {
availability: {
target: 99.9, // 99.9% uptime
window: '30 days',
errorBudget: 0.1 // 43.2 minutes/month
},
latency: {
p95: 200, // 95th percentile < 200ms
p99: 500, // 99th percentile < 500ms
}
};
Error Budget Formula: (1 - Actual Uptime) / (1 - SLO Target)
Contracts with consequences:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
SEV1 - Critical
SEV2 - High
SEV3 - Medium
SEV4 - Low
# Post-Mortem: [Incident Title]
**Date**: YYYY-MM-DD
**Severity**: SEV#
**Duration**: X hours Y minutes
**Impact**: X users affected
## What Happened
[Brief technical description]
## Root Cause
[Why it happened]
## Timeline
| Time | Event |
|------|-------|
| 14:00 | Issue detected |
| 14:05 | Team engaged |
| 14:20 | Service restored |
## What Went Well
- Quick detection
- Effective communication
## What Went Wrong
- No monitoring for X
- Insufficient testing
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| Add monitoring | SRE | P0 | 2024-04-15 |
| Update runbook | DevOps | P1 | 2024-04-20 |
Reliability
Monitoring
Incidents
Automation
Culture