Skill

sre-practices

Site Reliability Engineering practices from Google - the company that invented SRE. Master SLOs, error budgets, incident response, and toil elimination. Use when designing reliable systems, implementing SRE practices, or improving operational excellence. Learn from the team that runs Google Search, Gmail, and YouTube at billions of users scale.

npx claudepluginhub duylinhdang1998/claude-template-agent --plugin vfm-agent-company

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Expert**: Alex Kim (Google SRE, 11 years)

Supporting Assets

references/error-budgets.mdreferences/incident-response.mdreferences/slos.md

SKILL.md

Similar Skills

skill-lookup

159.9k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

prompt-lookup

159.9k

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

next-compile

139.2k

Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.

1 file

vercel-next-js-2

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 16, 2026

Actions

View Source View Plugin View on GitHub View README

SRE Practices - Google Site Reliability Engineering

Expert: Alex Kim (Google SRE, 11 years) Level: 10/10 - Google invented SRE

Overview

Site Reliability Engineering from Google - what happens when you ask a software engineer to design an operations team. Not traditional ops or DevOps - applying software engineering to infrastructure.

Google runs services for billions (Search, Gmail, YouTube, Maps) with 99.99%+ uptime. These practices made that possible.

Core SRE Principles

1. Embrace Risk

100% uptime is the wrong target. Use error budgets to balance reliability vs velocity.

2. Service Level Objectives (SLOs)

Define and measure service quality with SLIs, SLOs, SLAs.

3. Eliminate Toil

Automate manual, repetitive work. Target <50% time on toil.

4. Monitoring & Alerting

Alert on symptoms (user-facing), not causes. Use golden signals.

5. Incident Response

Blameless postmortems, clear escalation, reduce MTTR.

6. Capacity Planning

Plan for growth, forecast demand, optimize resource usage.

SRE Workflow

Define SLOs - What reliability do users need?
Measure SLIs - Track service quality metrics
Monitor error budget - How much budget consumed?
Respond to incidents - Restore service quickly
Conduct postmortems - Learn from failures
Automate toil - Reduce manual work
Plan capacity - Scale for growth

Google's Production Scale

SRE practices power:

Google Search: 8.5 billion searches/day
Gmail: 1.8 billion users
YouTube: 2 billion users, 1 billion hours/day
Google Maps: 1 billion users
99.99%+ uptime across all services

Golden Signals (Google's 4 Key Metrics)

Latency - Time to serve requests
Traffic - Demand on system
Errors - Failed requests
Saturation - Resource utilization

Best Practices

SLOs over SLAs - Internal targets stricter than external
Error budget policy - Define consequences when budget exhausted
Blameless culture - Learn from failures, don't blame
Toil automation - Invest in eliminating repetitive work
On-call sustainability - Max 25% on-call time, 50% ticket time

Related Skills

kubernetes-expert - Infrastructure platform
observability - Monitoring & tracing
chaos-engineering - Resilience testing

Last Updated: 2026-02-03 Expert: Alex Kim (Google SRE, 11 years) - Runs billion-user services