From runbooks
Generates troubleshooting runbooks for operational issues using 5-step framework with kubectl, bash, psql diagnostics for Kubernetes pods, DB connections, traffic, and external APIs.
npx claudepluginhub thebushidocollective/han --plugin runbooksThis skill is limited to using the following tools:
Creating effective troubleshooting guides for diagnosing and resolving operational issues.
Troubleshoots DevOps incidents via Kubernetes debugging, log analysis, distributed tracing with OpenTelemetry/Jaeger, observability tools like Prometheus, and root cause analysis for outages.
Troubleshoots DevOps incidents with observability tools (Prometheus, Grafana, OpenTelemetry), Kubernetes/container debugging (kubectl, Docker), network analysis (tcpdump, Wireshark), and performance issues.
Diagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Share bugs, ideas, or general feedback.
Creating effective troubleshooting guides for diagnosing and resolving operational issues.
# Troubleshooting: [Problem Statement]
## Symptoms
What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts
## Quick Checks (< 2 minutes)
### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
Expected: STATUS = Running
kubectl rollout history deployment/api-server
Check: Did we deploy in the last 30 minutes?
Check error rate in Datadog:
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |
Test:
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
If connections > 90: Pool is saturated. Next step: Increase pool size or investigate slow queries.
Test:
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
"https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
If requests 3x normal: Traffic spike. Next step: Scale up pods or enable rate limiting.
Test:
# Check third-party API
curl -w "@curl-format.txt" https://api.stripe.com/v1/charges
If response time > 2s: External service slow. Next step: Implement circuit breaker or increase timeouts.
Restart affected pods:
kubectl rollout restart deployment/api-server -n production
When to use: Quick mitigation while investigating root cause.
Scale up resources:
kubectl scale deployment/api-server --replicas=10 -n production
When to use: Traffic spike or resource exhaustion.
Fix root cause:
When to use: After immediate pressure is relieved.
How to prevent this issue in the future:
## Decision Tree Format
```markdown
# Troubleshooting: Slow API Responses
## Start Here
Check response time
|
┌──────────────┴──────────────┐
│ │
< 500ms > 500ms
│ │
NOT THIS RUNBOOK Continue below
## Step 1: Locate the Slowness
```bash
# Check which service is slow
curl -w "@timing.txt" https://api.example.com/users
Decision:
# Check active queries
psql -c "SELECT query, state, query_start FROM pg_stat_activity WHERE state != 'idle'"
Decision:
... (continue with network troubleshooting)
## Layered Troubleshooting
### Layer 1: Application
```markdown
## Application Layer Issues
### Check Application Health
1. **Health endpoint:**
```bash
curl https://api.example.com/health
Application logs:
kubectl logs deployment/api-server --tail=100 | grep ERROR
Application metrics:
Memory Leak
Thread Starvation
Code Bug
### Layer 2: Infrastructure
```markdown
## Infrastructure Layer Issues
### Check Infrastructure Health
1. **Node resources:**
```bash
kubectl top nodes
Pod resources:
kubectl top pods -n production
Network connectivity:
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping database.internal
Node Under Pressure
kubectl describe node for pressure conditionsNetwork Partition
Disk I/O Saturation
### Layer 3: External Dependencies
```markdown
## External Dependencies Issues
### Check External Services
1. **Third-party APIs:**
```bash
curl -w "@timing.txt" https://api.stripe.com/health
Status pages:
DNS resolution:
nslookup api.stripe.com
dig api.stripe.com
API Rate Limiting
Service Degradation
DNS Failure
## Systematic Debugging
### Use the Scientific Method
```markdown
# Debugging: Database Connection Failures
## 1. Observation
**What we know:**
- Error: "connection refused" in logs
- Started: 2025-01-15 14:30 UTC
- Frequency: Every database query fails
- Scope: All pods affected
## 2. Hypothesis
**Possible causes:**
1. Database instance is down
2. Security group blocking traffic
3. Network partition
4. Wrong credentials
## 3. Test Each Hypothesis
### Test 1: Database instance status
```bash
aws rds describe-db-instances --db-instance-identifier prod-db | jq '.DBInstances[0].DBInstanceStatus'
Result: "available" Conclusion: Database is running ✗ Hypothesis 1 rejected
aws ec2 describe-security-groups --group-ids sg-abc123 | jq '.SecurityGroups[0].IpPermissions'
Result: Port 5432 open only to 10.0.0.0/16 Pod IP: 10.1.0.5 Conclusion: Pod IP not in allowed range ✓ ROOT CAUSE FOUND
Update security group:
aws ec2 authorize-security-group-ingress \
--group-id sg-abc123 \
--protocol tcp \
--port 5432 \
--cidr 10.1.0.0/16
Test connection from pod:
kubectl exec -it api-server-abc -- psql -h prod-db.rds.amazonaws.com -c "SELECT 1"
Result: Success ✓
## Time-Boxed Investigation
```markdown
# Troubleshooting: Production Outage
**Time Box:** Spend MAX 15 minutes investigating before escalating.
## First 5 Minutes: Quick Wins
- [ ] Check pod status
- [ ] Check recent deploys
- [ ] Check external status pages
- [ ] Review monitoring dashboards
**If issue persists:** Continue to next phase.
## Minutes 5-10: Common Causes
- [ ] Restart pods (quick mitigation)
- [ ] Check database connectivity
- [ ] Review application logs
- [ ] Check resource limits
**If issue persists:** Continue to next phase.
## Minutes 10-15: Deep Dive
- [ ] Enable debug logging
- [ ] Capture thread dump
- [ ] Check for memory leaks
- [ ] Review network traces
**If issue persists:** ESCALATE to senior engineer.
## Escalation
**Escalate to:** Platform Team Lead
**Provide:**
- Timeline of issue
- Tests performed
- Current error rate
- Mitigation attempts
## Finding Which Service is Slow
Using binary search to narrow down the problem:
1. **Check full request:** 5000ms total
2. **Check first half (API → Database):** 4900ms
→ Problem is in database query
3. **Check database:** Query takes 4800ms
4. **Check query plan:** Sequential scan on large table
5. **Root cause:** Missing index
**Fix:** Add index on frequently queried column.
## Finding Related Events
Look for patterns and correlations:
**Timeline:**
- 14:25 - Deploy completed
- 14:30 - Error rate spike
- 14:35 - Database CPU at 100%
- 14:40 - Requests timing out
**Correlation:** Deploy introduced N+1 query.
**Evidence:**
- No config changes
- No infrastructure changes
- Only code deploy
- Error coincides with deploy
**Action:** Rollback deploy.
# Bad: Jump to complex solutions
## Database Slow
Must be a query optimization issue. Let's analyze query plans...
# Good: Check basics first
## Database Slow
1. Is the database actually running?
2. Can we connect to it?
3. Are there any locks?
4. What does the slow query log show?
# Bad: Random changes
## API Errors
Let's try:
- Restarting the database
- Scaling to 100 pods
- Changing the load balancer config
- Updating the kernel
# Good: Systematic approach
## API Errors
1. What is the actual error message?
2. When did it start?
3. What changed before it started?
4. Can we reproduce it?
# Bad: No notes
## Fixed It
I restarted some pods and now it works.
# Good: Document findings
## Resolution
**Root Cause:** Memory leak in worker process
**Evidence:** Pod memory climbing linearly over 6 hours
**Temporary Fix:** Restarted pods
**Long-term Fix:** PR #1234 fixes memory leak
**Prevention:** Added memory usage alerts