From internal-docs-writer
Write an operational runbook for a service, deployment, or incident response procedure.
npx claudepluginhub hpsgd/turtlestack --plugin internal-docs-writerThis skill is limited to using the following tools:
Write a runbook for $ARGUMENTS using the mandatory structure below.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Retrieves current documentation, API references, and code examples for libraries, frameworks, SDKs, CLIs, and services via Context7 CLI. Ideal for API syntax, configs, migrations, and setup queries.
Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.
Write a runbook for $ARGUMENTS using the mandatory structure below.
Core principle: This runbook will be used by someone who has never done this procedure before, at 2am, while stressed. Every decision must serve that reader. No assumed knowledge. No missing steps. No ambiguity.
Before writing, gather information:
Grep and GlobUse this exact structure. Every section is mandatory.
[Procedure name] — Runbook
| Field | Value |
|---|---|
| What this covers | [one sentence] |
| When to use | [trigger conditions — alert name, symptom, or scheduled occasion] |
| Estimated duration | [time range for the full procedure] |
| Risk level | Low / Medium / High / Critical |
| Last tested | [date] |
| Owner | [team or person responsible for maintaining this runbook] |
A checklist of everything needed BEFORE starting. Be exhaustive:
- [ ] Access to [system] — how to get it: [link or instructions]
- [ ] CLI tool [name] installed — version [X]+ — install: `[command]`
- [ ] Environment variable [NAME] set — get it from: [location]
- [ ] VPN connected to [network] — connect: [instructions]
- [ ] Permissions: [specific role/group] in [system] — request via: [link]
- [ ] Communication channel open: [Slack channel / war room] — notify: [who]
Rules for prerequisites:
Numbered steps. Each step MUST follow this format:
#### Step N: [what you're doing and why]
**Action:**
\`\`\`bash
[exact command to run — copy-pasteable, no placeholders without explanation]
\`\`\`
**Expected output:**
\`\`\`
[what you should see if this worked correctly]
\`\`\`
**If this fails:**
- Symptom: [what you might see instead]
- Likely cause: [why this happens]
- Fix: [what to do about it]
- If the fix doesn't work: [escalate to whom, or which step to go to]
**Checkpoint:** [how to confirm this step succeeded before moving on]
Rules for steps:
<placeholder> without explaining what to substitute and how to find the value. Use this pattern: export SERVICE_NAME=my-service # Replace with the actual service name from: kubectl get services⚠ WARNING: This step [modifies production data / causes downtime / is irreversible]. Double-check [what] before proceeding.After the procedure is complete, verify the system is healthy:
#### Verification checklist
- [ ] Service is responding: `curl -s https://[endpoint]/health | jq .status` → should return `"ok"`
- [ ] Logs are clean: `[log command]` → no errors in the last 5 minutes
- [ ] Metrics are normal: [dashboard link] → [what to look for]
- [ ] Dependent services unaffected: [how to check]
- [ ] Users can perform [core action]: [how to test]
Every verification check must include the exact command to run and the expected result.
If the procedure fails or causes issues, how to undo it. This section is not optional — every runbook needs a rollback plan.
#### Rollback procedure
**When to rollback:** [specific conditions that trigger a rollback decision]
**Rollback window:** [how long after the procedure can you still roll back?]
**Data implications:** [will rollback cause data loss? What data?]
1. [Rollback step with exact command]
Expected result: [what you should see]
2. [Next rollback step]
Expected result: [what you should see]
#### After rollback
- [ ] Verify rollback succeeded: [how]
- [ ] Notify: [who needs to know]
- [ ] Create incident ticket: [where, with what information]
Common issues that arise during or after this procedure, even if not directly caused by the steps:
#### [Problem description]
**Symptom:** [what you see]
**Cause:** [why it happens]
**Solution:**
\`\`\`bash
[fix command]
\`\`\`
**Prevention:** [how to avoid this in the future]
Include at minimum:
| Condition | Escalate to | Contact | Expected response time |
|---|---|---|---|
| Procedure fails after troubleshooting | [team/person] | [Slack/phone/page] | [time] |
| Data loss suspected | [team/person] | [Slack/phone/page] | Immediate |
| Customer impact detected | [team/person] | [Slack/phone/page] | Immediate |
| Unsure whether to proceed | [team/person] | [Slack/phone/page] | [time] |
Include both primary and backup contacts. If the primary is unavailable, who is next?
Before finalising, verify against every rule:
| Check | Requirement |
|---|---|
| Copy-paste test | Could someone paste every command and have it work? |
| Failure coverage | Does every step have a "what if this fails" section? |
| No assumed knowledge | Could a new team member follow this on their first week? |
| Rollback exists | Is there a complete rollback procedure? |
| Escalation contacts | Are real people or teams listed, with contact methods? |
| Timing estimates | Does the reader know how long things should take? |
| Warnings present | Are destructive or irreversible steps clearly marked? |
| Verification complete | Can the operator confirm success at the end? |
kubectl delete pod my-pod-name -n production is better than a bash one-liner that constructs the pod name dynamically.# UNTESTED — verify in staging first.