Skill

write-ops-runbook

Generates operational runbooks for repeatable incident procedures that any engineer can execute under pressure. Follows Google SRE and PagerDuty best practices.

devops

Popularity

Stars

Forks

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/grimoire:write-ops-runbook

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Create an operational runbook that any qualified engineer can execute correctly under pressure without prior context.

SKILL.md

57 lines · ~1.1k tokens

Stats

LanguageShell

Stars12

Forks1

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Write Runbook

Create an operational runbook that any qualified engineer can execute correctly under pressure without prior context.

Why This Is Best Practice

Adopted by: Google SRE teams, AWS operations, Cloudflare, PagerDuty — all treat runbooks as living operational documentation Impact: Google SRE reports that documented runbooks reduce mean time to resolution (MTTR) by 40–60% for known failure modes; they also reduce on-call burnout by eliminating repeated tribal-knowledge lookups

Why best: A good runbook transfers expert knowledge into a form that a competent non-expert can execute at 2 AM during an incident. Every step of ambiguity in a runbook multiplies the risk of error under stress.

Steps

Write the header block — Include: Alert name or trigger condition, severity level, service owner, last reviewed date, and link to the service dashboard. Someone opening this runbook needs orientation in under 10 seconds.
State the symptoms — List the exact alert text, error messages, or observable behaviors that bring someone to this runbook. Include screenshot or log snippet if available.
Define the impact — Who is affected? What is degraded? What is the customer-visible symptom? One paragraph, no jargon.
List prerequisites — Tools, permissions, VPN access, environment variables, or credentials required before starting. Do not assume the reader has these set up.
Write numbered steps — Each step is one action. Include the exact command to run (copy-pasteable), the expected output, and what to do if the output differs. Steps must be in order; no "see also" references mid-procedure.
Add decision branches — At key points, add "If X, go to Step 8. If Y, go to Step 12." Avoid branching more than twice in a single runbook; split into separate runbooks instead.
Document escalation — At what point does the responder escalate? To whom? With what information? Include Slack channel, PagerDuty escalation policy, and on-call contact.
Include rollback steps — For any action that modifies state (restart, deploy, config change), provide the exact rollback command. Never leave the operator without an undo.

Rules

Every command must be copy-pasteable with no placeholder editing required during execution.
Runbooks are reviewed and tested at least quarterly; stale runbooks are worse than no runbooks.
Each runbook covers exactly one failure mode — do not bundle multiple procedures.
Avoid prose explanations mid-procedure; move background to an appendix so operators skip to steps.
Link to the postmortem that created this runbook so the reader understands the failure context.

Examples

A Redis memory runbook might include:

Step 3: Check memory usage: redis-cli -h $REDIS_HOST info memory | grep used_memory_human
Expected output: used_memory_human:1.50G
If output shows >4G → proceed to Step 4 (eviction policy check). If Redis is unreachable → go to Step 9 (failover).

Common Mistakes

Prose runbooks — Paragraphs of explanation without numbered steps require the operator to parse meaning under stress; errors increase dramatically.
Stale commands — Commands referencing deprecated flags or old hostnames fail silently; operators lose time debugging the runbook instead of the incident.
Missing rollback — An operator who cannot undo a step will hesitate or skip it; hesitation extends incidents.

When NOT to Use

Do not write a runbook for a novel incident type that has never recurred — invest that effort in a postmortem and write the runbook only after the second occurrence confirms it is a repeatable failure mode.
Do not use a runbook to document architectural decisions or system design rationale; those belong in an ADR or design doc, and mixing them into runbooks forces operators to parse explanation when they need action.
Do not create runbooks for procedures that require real-time expert judgment on ambiguous system state — if the steps cannot be deterministic, a runbook creates false confidence and should be replaced with an escalation path to the subject matter expert.

write-ops-runbook

Popularity

Invocation

Context Preview

SKILL.md

write-ops-runbook

Popularity

Invocation

Context Preview

SKILL.md

Write Runbook

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

When NOT to Use

Similar Skills

Write Runbook

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

When NOT to Use

Similar Skills