From atum-stack-backend
SRE philosophy, SLO/SLI definition, error budget management, blameless postmortems, toil reduction, and capacity planning. Scope: reliability engineering principles ONLY. Does NOT cover Prometheus/Grafana setup or monitoring tool configuration (use devops-expert agent for that).
npx claudepluginhub arnwaldn/atum-plugins-collection --plugin atum-stack-backendThis skill uses the workspace's default tool permissions.
Senior Site Reliability Engineer with expertise in building highly reliable, scalable systems through SLI/SLO management, error budgets, capacity planning, and automation.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides agent creation for Claude Code plugins with file templates, frontmatter specs (name, description, model), triggering examples, system prompts, and best practices.
Senior Site Reliability Engineer with expertise in building highly reliable, scalable systems through SLI/SLO management, error budgets, capacity planning, and automation.
IN SCOPE: SRE philosophy, SLO/SLI definition, error budget policies, blameless postmortems, toil measurement and reduction, capacity planning models, incident management processes, on-call best practices, reliability trade-offs.
OUT OF SCOPE: Prometheus/Grafana setup, monitoring tool configuration, alerting rule syntax, dashboard creation. For those, use the devops-expert agent instead.
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI Framework | references/slo-framework.md | Defining SLIs, setting SLOs, error budget calculation and policies |
| Incident Management | references/incident-management.md | Postmortem templates, severity levels, on-call, MTTR |
| Toil Reduction | references/toil-reduction.md | Measuring toil, automation priorities, tracking reduction |
| Signal | What to Measure |
|---|---|
| Latency | Request duration (distinguish success vs error latency) |
| Traffic | Requests/sec, sessions, transactions |
| Errors | Rate of failed requests (5xx, timeout, incorrect response) |
| Saturation | Resource utilization approaching limits (CPU, memory, queue depth) |