Skill

anth-incident-runbook

Executes incident response for Anthropic Claude API outages, errors, rate limits, and latency with triage scripts, status checks, and mitigation steps.

Anthropic

Popularity

Parent stars

2,203

Parent forks

296

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/anthropic-pack:anth-incident-runbook

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBash(curl:*)Grep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

| Severity | Condition | Response Time |

SKILL.md

132 lines · ~1.1k tokens

Stats

LanguagePython

Parent stars2,203

Parent forks296

MaintenanceGood

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Anthropic Incident Runbook

Severity Classification

Severity	Condition	Response Time
P1	API returning 500/529 for all requests	Immediate
P2	Rate limiting (429) or high latency (>10s p99)	15 minutes
P3	Intermittent errors (<5% error rate)	1 hour
P4	Degraded quality (not errors)	Next business day

Immediate Triage (First 5 Minutes)

# 1. Check Anthropic status page
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(d['status']['indicator'], '-', d['status']['description'])"

# 2. Test API connectivity
curl -s -w "\nHTTP %{http_code} | Time: %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}'

# 3. Check rate limit headers
curl -s -D - https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' \
  2>/dev/null | grep -i "ratelimit\|retry-after\|request-id"

Decision Tree

API returning errors?
├── 401/403 → Key issue → Check ANTHROPIC_API_KEY is set and valid
├── 429 → Rate limited → Check headers, reduce traffic, wait for retry-after
├── 500 → Server error → Check status.anthropic.com, retry with backoff
├── 529 → Overloaded → Temporary, retry after 30-60s
└── Timeouts → Network or long generation → Increase timeout, check max_tokens

Mitigation Actions

Rate Limiting (429)

# Immediate: reduce traffic
# 1. Enable circuit breaker
# 2. Queue non-critical requests
# 3. Switch to Message Batches for bulk work
# 4. Reduce max_tokens to shorten generation time

API Outage (500/529)

# Graceful degradation
def get_response_with_fallback(prompt: str) -> str:
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return msg.content[0].text
    except (anthropic.InternalServerError, anthropic.APIStatusError):
        return "Our AI assistant is temporarily unavailable. Please try again shortly."

Key Compromise

# 1. Immediately revoke key at console.anthropic.com
# 2. Generate new key
# 3. Deploy new key to all environments
# 4. Audit recent usage for unauthorized calls
# 5. File incident report

Postmortem Template

## Incident: [Title]
- **Duration:** [start] to [end]
- **Severity:** P[1-4]
- **Impact:** [what users experienced]
- **Root Cause:** [what went wrong]
- **Detection:** [how we found out]
- **Mitigation:** [what we did to fix it]
- **Request IDs:** [from debug logs]
- **Action Items:**
  - [ ] [preventive measure 1]
  - [ ] [preventive measure 2]

Error Handling

Symptom	Likely Cause	Quick Fix
All requests fail 401	Key rotated/expired	Check Console for active keys
Sudden 429 spike	Traffic burst or tier change	Check rate limit headers
Slow responses (>10s)	Large max_tokens or complex prompt	Reduce max_tokens, use Haiku
Intermittent 500s	Upstream API issue	Check status.anthropic.com

Resources

Next Steps

For data compliance, see anth-data-handling.

anth-incident-runbook

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

anth-incident-runbook

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Anthropic Incident Runbook

Severity Classification

Immediate Triage (First 5 Minutes)

Decision Tree

Mitigation Actions

Rate Limiting (429)

API Outage (500/529)

Key Compromise

Postmortem Template

Error Handling

Resources

Next Steps

Similar Skills

Anthropic Incident Runbook

Severity Classification

Immediate Triage (First 5 Minutes)

Decision Tree

Mitigation Actions

Rate Limiting (429)

API Outage (500/529)

Key Compromise

Postmortem Template

Error Handling

Resources

Next Steps

Similar Skills