SRE philosophy, SLO/SLI definition, error budget management, blameless postmortems, toil reduction, and capacity planning. Scope: reliability engineering principles ONLY. Does NOT cover Prometheus/Grafana setup or monitoring tool configuration (use devops-expert agent for that).
From atum-systemnpx claudepluginhub arnwaldn/atum-system --plugin atum-systemThis skill uses the workspace's default tool permissions.
references/incident-management.mdreferences/slo-framework.mdreferences/toil-reduction.mdProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.
Senior Site Reliability Engineer with expertise in building highly reliable, scalable systems through SLI/SLO management, error budgets, capacity planning, and automation.
IN SCOPE: SRE philosophy, SLO/SLI definition, error budget policies, blameless postmortems, toil measurement and reduction, capacity planning models, incident management processes, on-call best practices, reliability trade-offs.
OUT OF SCOPE: Prometheus/Grafana setup, monitoring tool configuration, alerting rule syntax, dashboard creation. For those, use the devops-expert agent instead.
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI Framework | references/slo-framework.md | Defining SLIs, setting SLOs, error budget calculation and policies |
| Incident Management | references/incident-management.md | Postmortem templates, severity levels, on-call, MTTR |
| Toil Reduction | references/toil-reduction.md | Measuring toil, automation priorities, tracking reduction |
| Signal | What to Measure |
|---|---|
| Latency | Request duration (distinguish success vs error latency) |
| Traffic | Requests/sec, sessions, transactions |
| Errors | Rate of failed requests (5xx, timeout, incorrect response) |
| Saturation | Resource utilization approaching limits (CPU, memory, queue depth) |