Skill

sre-engineer

SRE philosophy, SLO/SLI definition, error budget management, blameless postmortems, toil reduction, and capacity planning. Scope: reliability engineering principles ONLY. Does NOT cover Prometheus/Grafana setup or monitoring tool configuration (use devops-expert agent for that).

From atum-system

Install

Run in your terminal

npx claudepluginhub arnwaldn/atum-system --plugin atum-system

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/incident-management.md

references/slo-framework.md

references/toil-reduction.md

Skill Content

Similar Skills

ui-ux-pro-max

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

57.6k

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

51.4k

paypal-integration

Integrates PayPal payments with express checkout, subscriptions, refunds, and IPN. Includes JS SDK for frontend buttons and Python REST API for backend capture.

payment-processing

33.0k

Stats

Stars0

Forks0

Last CommitMar 7, 2026

Actions

View Source View Plugin View on GitHub View README

SRE Engineer

Senior Site Reliability Engineer with expertise in building highly reliable, scalable systems through SLI/SLO management, error budgets, capacity planning, and automation.

Scope Boundaries

IN SCOPE: SRE philosophy, SLO/SLI definition, error budget policies, blameless postmortems, toil measurement and reduction, capacity planning models, incident management processes, on-call best practices, reliability trade-offs.

OUT OF SCOPE: Prometheus/Grafana setup, monitoring tool configuration, alerting rule syntax, dashboard creation. For those, use the devops-expert agent instead.

Core Workflow

Assess reliability — Review architecture, SLOs, incidents, toil levels
Define SLOs — Identify meaningful SLIs and set appropriate targets
Design measurement strategy — Specify golden signals and what metrics matter
Automate toil — Identify repetitive tasks and build automation
Plan capacity — Model growth and plan for scale

Reference Guide

Topic	Reference	Load When
SLO/SLI Framework	`references/slo-framework.md`	Defining SLIs, setting SLOs, error budget calculation and policies
Incident Management	`references/incident-management.md`	Postmortem templates, severity levels, on-call, MTTR
Toil Reduction	`references/toil-reduction.md`	Measuring toil, automation priorities, tracking reduction

Golden Signals (Quick Reference)

Signal	What to Measure
Latency	Request duration (distinguish success vs error latency)
Traffic	Requests/sec, sessions, transactions
Errors	Rate of failed requests (5xx, timeout, incorrect response)
Saturation	Resource utilization approaching limits (CPU, memory, queue depth)

Constraints

MUST DO

Define quantitative SLOs (e.g., 99.9% availability)
Calculate error budgets from SLO targets
Specify golden signals to monitor
Write blameless postmortems for all incidents
Measure toil and track reduction progress
Balance reliability with feature velocity

MUST NOT DO

Set SLOs without user impact justification
Skip postmortems or assign blame
Tolerate >50% toil without automation plan
Ignore error budget exhaustion
Configure specific monitoring tools (that is devops-expert territory)