From agent-almanac
Creates structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates. Useful for documenting recurring alerts, standardizing on-call responses, and reducing MTTR.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Create actionable runbooks that guide responders through incident diagnosis and resolution.
Creates incident response runbooks with severity levels, detection triggers, communication steps, and mitigations like Kubernetes rollbacks and scaling via kubectl/bash.
Provides production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. Use for creating service-specific procedures and escalation paths.
Creates structured incident runbook templates with severity levels, triage, mitigation, resolution, escalation paths, and communication for outages and incidents.
Share bugs, ideas, or general feedback.
Create actionable runbooks that guide responders through incident diagnosis and resolution.
See Extended Examples for complete template files.
Select an appropriate template based on incident type and complexity.
Basic runbook template structure:
# [Alert/Incident Name] Runbook
## Overview | Severity | Symptoms
## Diagnostic Steps | Resolution Steps
## Escalation | Communication | Prevention | Related
Advanced SRE runbook template (excerpt):
# [Service Name] - [Incident Type] Runbook
## Metadata
- Service, Owner, Severity, On-Call, Last Updated
## Diagnostic Phase
### Quick Health Check (< 5 min): Dashboard, error rate, deployments
### Detailed Investigation (5-20 min): Metrics, logs, traces, failure patterns
# ... (see EXAMPLES.md for complete template)
Key template components:
Expected: Template selected matches incident complexity, sections appropriate for service type.
On failure:
See Extended Examples for complete diagnostic queries and decision trees.
Create step-by-step investigation procedures with specific queries.
Six-step diagnostic checklist:
Verify Service Health: Health endpoint checks and uptime metrics
curl -I https://api.example.com/health # Expected: HTTP 200 OK
up{job="api-service"} # Expected: 1 for all instances
Check Error Rate: Current error percentage and breakdown by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100 # Expected: < 1%
Analyze Logs: Recent errors and top error messages from Loki
{job="api-service"} |= "error" | json | level="error"
Check Resource Utilization: CPU, memory, and connection pool status
avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# Expected: < 70%
Review Recent Changes: Deployments, git commits, infrastructure changes
Examine Dependencies: Downstream service health, database/API latency
Failure pattern decision tree (excerpt):
Expected: Diagnostic procedures are specific, include expected vs actual values, guide responder through investigation.
On failure:
See Extended Examples for all 5 resolution options with full commands and rollback procedures.
Document step-by-step remediation with rollback options.
Five resolution options (brief summary):
Rollback Deployment (fastest): For post-deployment errors
kubectl rollout undo deployment/api-service
Verify → Monitor → Confirm resolution (error rate < 1%, latency normal, no alerts)
Scale Up Resources: For high CPU/memory, connection pool exhaustion
kubectl scale deployment/api-service --replicas=$((current * 3/2))
Restart Service: For memory leaks, stuck connections, cache corruption
kubectl rollout restart deployment/api-service
Feature Flag / Circuit Breaker: For specific feature errors or external dependency failures
kubectl set env deployment/api-service FEATURE_NAME=false
Database Remediation: For database connections, slow queries, pool exhaustion
-- Kill long-running queries, restart connection pool, increase pool size
Universal verification checklist:
Rollback procedure: If resolution worsens situation → pause/cancel → revert → reassess
Expected: Resolution steps are clear, include verification checks, provide rollback options for each action.
On failure:
See Extended Examples for full escalation levels and contact directory template.
Define when and how to escalate incidents.
When to escalate immediately:
Five escalation levels:
Escalation process:
Contact directory: Maintain table with role, Slack, phone, PagerDuty for:
Expected: Clear criteria for escalation, contact information readily accessible, escalation paths aligned with organizational structure.
On failure:
See Extended Examples for all internal and external templates with full formatting.
Provide pre-written messages for incident updates.
Internal templates (Slack #incident-response):
Initial Declaration:
🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium]
Impact: [users/services] | Owner: @username | Dashboard: [link]
Quick Summary: [1-2 sentences] | Next update: 15 min
Progress Update (every 15-30 min):
📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring]
Actions: [what we tried and outcomes]
Theory: [what we think is happening]
Next: [planned actions]
Mitigation Complete:
✅ MITIGATION | Metrics: Error [before→after], Latency [before→after]
Root Cause: [brief or "investigating"] | Monitoring 30min before resolved
Resolution:
🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions
False Alarm: No impact, no follow-up needed
External templates (status page):
Customer email template: Timeline, impact description, resolution, prevention, compensation (if applicable)
Expected: Templates save time during incidents, ensure consistent communication, reduce cognitive load on responders.
On failure:
See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.
Integrate runbook with alerts and dashboards.
Add runbook links to Prometheus alerts:
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"
Embed quick diagnostic links in runbook:
Create Grafana dashboard panel with runbook links (markdown panel listing all incident runbooks with on-call and escalation info)
Expected: Responders can access runbooks directly from alerts or dashboards, diagnostic queries pre-filled, one-click access to relevant tools.
On failure:
configure-alerting-rules - Link runbooks to alert annotations for immediate access during incidentsbuild-grafana-dashboards - Embed runbook links in dashboards and diagnostic panelssetup-prometheus-monitoring - Include diagnostic queries from Prometheus in runbook proceduresdefine-slo-sli-sla - Reference SLO impact in incident severity classification