Incident triage orchestrator — classifies severity, diagnoses in parallel, routes to /sre, /ci-fix, or /fix based on evidence. Usage: /incident <description>
From sentinelnpx claudepluginhub arthtech-ai/arthai-marketplace --plugin sentinel/incidentGuides interactive incident response workflow: triage severity, mitigate issues, perform root cause analysis with 5 Whys, resolve, and generate post-mortem documentation. Also supports SEV1 and post-mortem modes.
/incidentOrchestrates incident response for specified <incident> using SRE best practices, supporting optional [phase] like triage or postmortem.
/incidentOpen and manage a production incident. Creates a structured incident document, guides root cause analysis, generates status page updates and Slack messages, and sets up a post-mortem after resolution.
/incidentIncident response and blameless post-mortem management. Classifies incidents by severity (SEV1-4), constructs precise timelines, performs root cause analysis, and produces actionable post-mortem documents with tracked action items.
The single entry point for anything going wrong. Classifies, diagnoses, and routes to
the right skill — you don't need to figure out if it's an infra issue, CI failure, or
code bug. /incident figures it out for you.
Flow: Classify → Parallel Diagnosis → Correlate → Challenge → Verdict → Route → Resolve → Learn
| Input | Action |
|---|---|
/incident the site is down | Free-text incident description |
/incident #234 | Load incident from GitHub issue |
/incident (no args) | Auto-detect — check health, CI, recent deploys |
From the description, classify severity and type immediately.
| Keywords | Severity | Response Time |
|---|---|---|
| down, outage, crash, 500 in production, data loss, security breach | CRITICAL | Act immediately, no questions |
| slow, timeout, degraded, errors increasing, staging broken | HIGH | Act within minutes |
| failing, broken, error, not working, regression | MEDIUM | Investigate, then act |
| flaky, intermittent, warning, minor, cosmetic | LOW | Queue for next available time |
| Keywords | Type | Primary Route |
|---|---|---|
| CI, pipeline, build failed, lint, test fail, Actions, workflow | CI Failure | /ci-fix |
| deploy, health, down, outage, production, staging, 500, 502, 503 | Infrastructure | /sre debug |
| bug, fix, issue #, regression, broken feature, wrong behavior | Code Bug | /fix |
| slow, performance, timeout, latency, memory, CPU | Performance | /sre debug (perf focus) |
| restart, start, stop, hung, frozen, local | Local Ops | /restart or ops agent |
| database, migration, schema, data corruption | Data Issue | /sre debug (data focus) |
| auth, login, permission, 401, 403 | Auth Issue | Could be infra OR code — diagnose first |
INCIDENT — {severity}: {type}
Description: {user's description}
Don't ask questions for CRITICAL severity. Just proceed to Phase 2. For HIGH/MEDIUM, briefly confirm the description then proceed. For LOW, ask if they want full triage or just a quick check.
Spawn 4 cheap agents IN PARALLEL to gather evidence fast. Use explore-light and ops agents (Haiku, 1x cost) — not Sonnet/Opus. Speed and cost matter here.
subagent_type: "explore-light"
name: "health-check"
prompt: "Read CLAUDE.md for health endpoints and infrastructure details.
Hit every health endpoint listed. Also check:
- Production URL: {from CLAUDE.md or project-profile.md}
- Staging URL: {if configured}
Report pass/fail for each endpoint with response time and status code."
subagent_type: "ops"
name: "deploy-check"
prompt: "Check recent deployment activity:
1. git log --oneline -5 main (what recently shipped)
2. gh run list --branch main --limit 5 (CI status)
3. If Railway: railway status (or platform-appropriate command)
Report: what was the last deploy, when, did CI pass, any errors."
subagent_type: "ops"
name: "error-check"
prompt: "Check for error signals:
1. If deploy platform has logs: get last 50 lines, filter for ERROR/WARN/FATAL
2. If Sentry configured (check .env or CLAUDE.md): note DSN existence
3. If Docker running: docker ps --format table, check for unhealthy/restarting containers
4. Check if any services are expected but not running (from CLAUDE.md Local Dev Services)
Report: active errors, unhealthy services, recent error patterns."
subagent_type: "ops"
name: "kb-lookup"
prompt: "Search for similar past incidents:
1. Read .claude/qa-knowledge/incidents/ — any files mentioning {affected area keywords}
2. Read .claude/qa-knowledge/bug-patterns.md — any patterns matching {symptoms}
3. Read .claude/knowledge/agents/sre.md — any past resolutions for similar issues
Report: similar past incidents with root causes and how they were resolved."
CRITICAL: Spawn all 4 in ONE message for maximum parallelism.
When all 4 agents report back, correlate their findings:
| Health | Last Deploy | CI | Errors | Diagnosis | Route |
|---|---|---|---|---|---|
| DOWN | Recent (< 1hr) | Green | Deploy errors in logs | Bad deploy | Revert or /fix |
| DOWN | Recent (< 1hr) | Red | Test failures | Broken CI shipped | Revert, then /ci-fix |
| DOWN | None recent | — | Infra errors | Infrastructure failure | /sre debug |
| DOWN | None recent | — | No errors | External dependency | Check 3rd party status |
| UP | — | Red | CI errors | CI broken, prod OK | /ci-fix (not urgent) |
| UP | — | Green | App errors in logs | Code bug in prod | /fix with log context |
| UP | — | Green | No errors | Intermittent/resolved | Monitor, check if still happening |
| SLOW | Recent | — | Timeout errors | Perf regression from deploy | /sre debug + /fix |
| SLOW | None recent | — | DB slow queries | Database performance | /sre debug (data focus) |
| N/A | — | Red | — | CI failure only | /ci-fix |
| N/A | — | — | Local errors | Local dev issue | ops agent or /restart |
DIAGNOSIS:
Severity: {CRITICAL/HIGH/MEDIUM/LOW}
Type: {infrastructure/CI/code bug/performance/data/auth}
Evidence:
Health: {UP/DOWN/SLOW} — {details}
Last deploy: {time} — {commit message}
CI: {GREEN/RED} — {details}
Errors: {summary}
Similar past incident: {if found}
Root cause hypothesis: {what we think happened based on correlation}
Recommended action: {specific skill to invoke}
Preliminary confidence: {HIGH (4 signals aligned) / MEDIUM (3 signals aligned) / LOW (2 or fewer signals aligned)}
Dissenting signals: {list any signals that don't fit the primary hypothesis, or "none"}
Confidence is computed from how many of the 4 diagnostic signals (Health, Last Deploy, CI, Errors) align with the matched Decision Matrix row:
Before routing to a resolution skill, subject the primary diagnosis to adversarial challenge. 90 seconds of verification prevents 30 minutes of wrong-path investigation.
| Severity | Confidence | Mode | Time Budget | Agents | Condition |
|---|---|---|---|---|---|
| CRITICAL | HIGH | Skip | 0s | None | Unambiguous evidence — act now |
| CRITICAL | MEDIUM or LOW | Fast Challenge | 30s | 1 Haiku (qa-challenger) | Quick sanity check before committing |
| HIGH | Any | Full Challenge | 90s | 2 Haiku agents | Worth 90s to avoid 30min wrong-path |
| MEDIUM | Any | Full Challenge | 90s | 2 Haiku agents | Same as HIGH |
| LOW | Any | Full Challenge | 120s | 2 Haiku agents | Extra time, low urgency |
Spawn a single qa-challenger subagent as devil's advocate:
subagent_type: "qa-challenger"
model: haiku
name: "devil-advocate-fast"
prompt: "You are a devil's advocate reviewing an incident diagnosis.
Incident: {original description}
Evidence gathered:
Health: {Phase 2.1 result}
Last deploy: {Phase 2.2 result}
CI status: {Phase 2.2 CI result}
Error signals: {Phase 2.3 result}
KB matches: {Phase 2.4 result}
Primary diagnosis: {hypothesis from Phase 3}
Recommended action: {skill from Phase 3}
Your job: in 30 seconds, challenge this diagnosis.
- Is there an alternative explanation that fits the evidence better?
- Is any key evidence being ignored?
- Could this be a different failure mode?
Output exactly:
CHALLENGE: {alternative hypothesis} OR NONE — diagnosis looks correct
CONFIDENCE: HIGH / MEDIUM / LOW
KEY EVIDENCE: {the piece of evidence that drives your challenge, or n/a}"
Orchestrator enforces the 30-second time budget. If no response in 30s, proceed with primary hypothesis.
Create an agent team for cross-domain adversarial review:
TeamCreate: "incident-challenge-{timestamp}"
Agent 1 — devil-advocate:
subagent_type: "qa-challenger"
model: haiku
role: Challenges primary hypothesis, proposes alternatives
Agent 2 — alt-hypothesis:
subagent_type: "sre" (if primary routes to /fix or /ci-fix)
OR "ops" (if primary routes to /sre debug — always the OPPOSITE domain)
model: haiku
role: Cross-domain verification of the challenge
Agent 2 domain selection rule: always choose the domain OPPOSITE to where the primary hypothesis routes. If the primary diagnosis says "code bug → /fix", Agent 2 is an SRE (infra perspective). If the primary says "infrastructure → /sre debug", Agent 2 is an ops agent (code/config perspective).
SendMessage flow (max 2 hops):
Hop 1 — devil-advocate → alt-hypothesis:
"Challenge: Primary diagnosis says {primary hypothesis} but I want your cross-domain view.
Evidence:
Health: {Phase 2.1 result}
Last deploy: {Phase 2.2 result}
CI status: {Phase 2.2 CI result}
Error signals: {Phase 2.3 result}
My alternative hypothesis: {alternative — or 'I agree with primary if no better alternative'}
What do you see from your domain perspective?"
Hop 2 — alt-hypothesis → devil-advocate:
"Verdict: {AGREE-PRIMARY | AGREE-CHALLENGE | THIRD-HYPOTHESIS}
Reasoning: {brief — max 2 sentences}
Additional evidence: {any cross-domain signal that supports your verdict}"
devil-advocate → team lead (orchestrator): final positions summary:
CHALLENGE SUMMARY:
devil-advocate position: {primary is correct / alternative: X}
alt-hypothesis verdict: {AGREE-PRIMARY / AGREE-CHALLENGE / THIRD-HYPOTHESIS: X}
Key disagreement: {what they disagreed on, or "none — consensus reached"}
If at any point during Phase 3b the user sends "just fix it" (or equivalent urgency signal), immediately proceed with the primary hypothesis. Skip remaining challenge steps. Document the override in the incident report.
Compute a confidence score from Phase 3b results. This score determines whether to proceed, merge, or escalate.
Start at 1.0 and apply adjustments:
| Signal | Adjustment |
|---|---|
| DA says "NONE — diagnosis looks correct" | +0.0 (stay at 1.0) |
| DA proposes alternative with HIGH confidence | -0.4 |
| DA proposes alternative with MEDIUM confidence | -0.2 |
| DA proposes alternative with LOW confidence | -0.1 |
| alt-hypothesis AGREES with challenge (not primary) | -0.3 |
| alt-hypothesis proposes THIRD hypothesis | -0.2 |
| alt-hypothesis AGREES with primary | +0.1 (cap at 1.0) |
| KB found matching past incident (Phase 2.4) | +0.1 (cap at 1.0) |
| Fast Challenge mode (only 1 agent ran) | No adjustment — use score as-is |
| Skip mode (CRITICAL + HIGH confidence) | Score = 1.0 by definition |
| Score | Verdict | Action |
|---|---|---|
| >= 0.8 | PROCEED | Continue with primary hypothesis unchanged |
| 0.5 – 0.79 | MERGE | Incorporate challenger insights into the fix context. Document fallback hypothesis for Phase 4 agents. Proceed with primary but stay alert for signs of the alternative. |
| < 0.5 | ESCALATE | Do not auto-route. Present competing hypotheses to user. |
ESCALATE output format:
Competing hypotheses — your call:
[A] Primary: {primary hypothesis}
Evidence for: {supporting signals}
Confidence: {score before challenge}
[B] Challenger: {alternative hypothesis}
Evidence for: {challenger's key evidence}
From: {da / alt-hypothesis / both}
[C] Investigate both — run parallel diagnosis targeting both hypotheses
Which path? (A / B / C)
VERDICT GATE:
Challenge mode: {skip / fast / full}
Primary hypothesis: {diagnosis}
Challenger finding: {agreed / alternative: X / third hypothesis: X}
Confidence score: {0.0–1.0}
Verdict: {PROCEED / MERGE / ESCALATE}
Fallback hypothesis: {if MERGE — what to watch for during resolution}
Based on the diagnosis, invoke the appropriate skill. Do NOT ask the user which skill to use. The correlation already determined it.
/sre debugInvoke /sre debug with context:
- Description: {original description}
- Health check results: {from Phase 2.1}
- Deploy history: {from Phase 2.2}
- Error logs: {from Phase 2.3}
- Similar incidents: {from Phase 2.4}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}
The SRE agent gets all the context from parallel diagnosis — doesn't need to re-discover.
/ci-fixInvoke /ci-fix with context:
- Mode: ci (or staging/prod based on diagnosis)
- Branch: {from deploy check}
- Known patterns: {from KB lookup}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}
/fixInvoke /fix with context:
- Description: {original description + diagnosis}
- Severity: {from Phase 1}
- Error logs: {from Phase 2.3 — gives the agent a head start on root cause}
- Similar incidents: {from Phase 2.4 — may identify root cause immediately}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}
If an issue number was provided (/incident #234), pass it: /fix #234 --severity {classified severity}
/sre debug with perf focusInvoke /sre debug with:
- Description: "Performance degradation: {description}"
- Focus: "Use RED method (Rate/Errors/Duration) for API endpoints.
Check: database query times, connection pool, cache hit rates, memory usage."
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}
/restart or ops agentFor local issues (services not starting, local errors):
- Invoke /restart skill
- Or spawn ops agent for specific task
/sre status then decideIf correlation is inconclusive:
1. Run /sre status for complete system overview
2. Present findings to user
3. Ask: "Based on this, it looks like {hypothesis}. Should I proceed with {skill}?"
After the routed skill completes its full lifecycle. Important: different skills have
different post-implementation flows — /incident must wait for the ENTIRE flow, not just
the fix itself.
| Routed to | What that skill does after fixing | When /incident Phase 5 starts |
|---|---|---|
/fix | Step 6: QA → restart servers → user tests → /pr --skip-qa → PR created | After PR is created (Step 6.4 completes) |
/ci-fix | Retries CI, may push fixes | After CI passes or exhausts retries |
/sre debug | Infra fix (no code) → verify health | Immediately after health verified |
/sre debug → escalates to /fix | Same as /fix row above | After /fix's full lifecycle completes |
/restart | Restarts services → verify health | Immediately after health verified |
Key rule: Do NOT duplicate QA, server restarts, or user testing that the routed skill
already handles. /incident Phase 5 is about verification, reporting, and learning — not
re-running the same checks.
For code bug routes (/fix, /ci-fix): The routed skill already ran QA, restarted
servers, and got user confirmation. Phase 5.1 only re-runs health checks to confirm
the deployed fix (if auto-deployed) or that local state is clean:
# Re-run the health checks from Phase 2.1
# Compare: were the failing endpoints now passing?
# Compare: are the error signals gone?
For infra routes (/sre debug, /restart): These don't go through QA/PR.
Phase 5.1 is the primary verification — check health endpoints and error signals.
If the infra fix involved config or environment changes that could affect behavior:
After infra fix, quick sanity check:
1. Hit all health endpoints (from CLAUDE.md Environments table)
2. Check error signals have cleared
3. If config was changed: ask user to smoke-test manually
Create .claude/qa-knowledge/incidents/{date}-{slug}.md:
---
status: resolved
severity: {severity}
type: {type}
affected: {services/endpoints}
duration: {time from report to resolution}
root_cause: {from the skill that fixed it}
resolved_by: {/sre, /ci-fix, /fix, /restart}
---
## Timeline
- {time}: Incident reported — "{original description}"
- {time}: Parallel diagnosis — {what was found}
- {time}: Routed to {skill} — {diagnosis}
- {time}: Resolution — {what was done}
- {time}: Verified — {health checks pass}
## Root Cause
{from the resolving skill}
## Prevention
{what would prevent this from happening again}
## Similar Past Incidents
{from Phase 2.4 KB lookup}
## Challenge Results
- Challenge mode: {skip / fast / full}
- Primary hypothesis: {diagnosis from Phase 3}
- Challenger finding: {agreed / alternative proposed: X / third hypothesis: X}
- Verdict: {PROCEED / MERGE / ESCALATE}
- Confidence score: {0.0–1.0}
.claude/knowledge/agents/sre.md: incident summary + resolution.claude/qa-knowledge/bug-patterns.md.claude/knowledge/skills/incident.md: challenge effectiveness data —
whether the challenger caught a real issue, whether the primary hypothesis was correct,
and the final confidence score. This builds a record of when challenge modes add value
vs. when they confirm the obvious.If Discord MCP is available:
Send to #deployments:
"Incident resolved — {severity} {type}
Root cause: {summary}
Fix: {what was done}
Duration: {time}"
After resolution is verified and reported:
Incident resolved. Duration: {time}.
What's next?
[1] Monitor — keep watching for recurrence (/sre health in 5 min)
[2] Fix another issue — /fix #N or /incident <description>
[3] See project status — /onboard
[4] Done for now
For CRITICAL/HIGH severity, default to option 1 (monitor) and suggest running
/sre health after 5-10 minutes to confirm the fix holds.
/incident with no args)When invoked without a description, run Phase 2 checks proactively:
If everything is green:
All systems healthy:
✓ Health endpoints: all responding
✓ CI: last 3 runs green
✓ Deploy: last deploy {time ago}, healthy
✓ No error signals detected
Nothing to triage. What prompted the check?
If something is wrong, proceed to Phase 3 correlation, Phase 3b challenge, Phase 3c verdict, and Phase 4 routing automatically.
| Condition | Escalation |
|---|---|
| /sre debug finds a code bug | → Escalate to /fix with SRE's findings as context |
| /ci-fix exhausts 3 attempts | → Escalate to /fix (deeper investigation needed) |
| /fix finds an infra issue (not code) | → Escalate back to /sre debug |
| Any skill fails to resolve in 15 min | → Alert on Discord, present options to user |
| CRITICAL severity not resolved in 30 min | → Suggest revert: git revert HEAD && git push |
| Skill | How /incident uses it |
|---|---|
/sre debug | Routed to for infrastructure issues, gets pre-gathered context |
/ci-fix | Routed to for CI failures, gets branch and pattern context |
/fix | Routed to for code bugs, gets error logs and similar incidents |
/restart | Routed to for local ops issues |
/qa | Run after resolution to verify no regressions |
| explore-light | Phase 2 health checks (1x cost) |
| ops agent | Phase 2 deploy/error/KB checks (1x cost) |
/review-pr | After /fix creates a PR, review it before merge |
| Phase | Agents | Cost |
|---|---|---|
| Phase 1: Classify | None (pattern matching) | 0 |
| Phase 2: Diagnose | 4 Haiku agents in parallel | 4x |
| Phase 3: Correlate | None (analysis) | 0 |
| Phase 3b: Challenge (skip) | None | 0 |
| Phase 3b: Challenge (fast) | 1 Haiku agent | 1x |
| Phase 3b: Challenge (full) | 2 Haiku agents | 2x |
| Phase 4: Route | 1 Sonnet agent (SRE/fix/ci-fix) | 10x |
| Phase 5: Verify | 1 Haiku agent (health check) | 1x |
| Severity | Challenge Mode | Total Cost | Notes |
|---|---|---|---|
| CRITICAL + HIGH confidence | Skip | ~15x | No challenge overhead |
| CRITICAL + med/low confidence | Fast (1 Haiku) | ~16x | +1x for 30s sanity check |
| HIGH | Full (2 Haiku) | ~17x | +2x for 90s full challenge |
| MEDIUM | Full (2 Haiku) | ~17x | Same as HIGH |
| LOW | Full (2 Haiku) | ~17x | 120s budget, same agent cost |
Break-even analysis: Full challenge adds ~2x cost (2 Haiku agents). A single wrong routing — running /sre debug when the issue is a code bug — wastes a full Sonnet invocation (10x) plus the time to re-diagnose and re-route. Challenge pays for itself if it catches one misrouting in every 5 incidents. At typical incident rates, that's almost always worth it.
/sre finds a code bug, it routes to /fix without asking.Incidents don't happen in a vacuum. Stakeholders need updates throughout — not just at the end.
Check these sources to find configured channels:
| Signal | Channel | How to use |
|---|---|---|
Discord MCP in .claude/settings.json | Discord | mcp__discord-mcp__send-message(channel, message) |
slack@claude-plugins-official enabled | Slack | Slack plugin send message |
SLACK_WEBHOOK_URL in env | Slack webhook | curl -X POST -d '{"text":"..."}' $SLACK_WEBHOOK_URL |
DISCORD_WEBHOOK_URL in env | Discord webhook | curl -X POST -d '{"content":"..."}' $DISCORD_WEBHOOK_URL |
PAGERDUTY_* in env | PagerDuty | PagerDuty API for escalation |
OPSGENIE_* in env | OpsGenie | OpsGenie API for alerting |
| GitHub issue exists for the incident | GitHub | Comment on the issue with updates |
TEAMS_WEBHOOK_URL in env | Microsoft Teams | Teams webhook API |
| Notion MCP configured | Notion | Create/update incident page |
| Linear MCP configured | Linear | Create/update incident issue |
On first run, detect which channels are available and store in knowledge base.
| Phase | Who to notify | What to say | Channel |
|---|---|---|---|
| Phase 1 (Classify) | On-call / team | "Incident detected: {severity} — {description}. Investigating." | Discord/Slack #incidents |
| Phase 3 (Correlate) | On-call / team | "Diagnosis: {type} — {hypothesis}. Routing to {skill}." | Discord/Slack #incidents |
| Phase 4 (mid-fix) | Stakeholders if CRITICAL | "Update: root cause identified — {cause}. Fix in progress. ETA: {estimate}." | Discord/Slack #incidents + PagerDuty |
| Phase 5 (Resolved) | Everyone | "Resolved: {root cause}. Fix: {what was done}. Duration: {time}." | Discord/Slack #incidents + GitHub issue |
| Post-incident | Team lead | Full incident report with timeline, root cause, prevention | Knowledge base + Notion/Linear |
If the user (or a stakeholder via Discord/Slack) provides additional context during the incident, incorporate it:
User: "actually it only affects users in EU"
→ Update scope: EU region only
→ Check: is there a regional configuration difference?
→ Update diagnosis with this context
User: "we just deployed a config change 10 minutes ago"
→ Highest suspect: the config change
→ Check the config diff
→ Skip broad diagnosis, focus on config
User: "don't revert, we need that deploy"
→ Respect the constraint
→ Find a forward fix instead of revert
→ Update incident notes with the constraint
| Source | What it provides | How to access |
|---|---|---|
| Git history | Recent commits, who changed what | git log, git blame |
| GitHub/GitLab | CI status, PRs, issues, deployments | gh CLI |
| CLAUDE.md | Health endpoints, service topology | Read tool |
| Docker | Container health, resource usage | docker ps, docker stats |
| Project logs | Application errors, access logs | Platform-specific CLI |
| MCP Server | What it unlocks |
|---|---|
| Postgres MCP | Direct DB queries — check connection count, slow queries, table sizes |
| Redis MCP | Cache hit rates, memory usage, connected clients |
| MongoDB MCP | Collection stats, slow operations, replica set health |
| Sentry MCP | Error rates, affected users, stack traces, release health |
| Datadog MCP | APM traces, infrastructure metrics, log patterns |
| AWS MCP | CloudWatch metrics, ECS task health, RDS stats, Lambda errors |
| GCP MCP | Cloud Monitoring, Error Reporting, Trace |
| Cloudflare MCP | Edge analytics, WAF events, origin health |
| PagerDuty MCP | Active incidents, on-call schedule |
If diagnosis is limited by missing data sources:
Incident triage limited — missing data sources:
⚠ No monitoring connected — can't check error rates or performance metrics
Recommend: Sentry MCP (free tier) or Datadog MCP
/calibrate can install this automatically
⚠ No database MCP — can't check query performance or connection health
Recommend: Postgres MCP / Redis MCP based on your stack
/calibrate can install this automatically
⚠ No alerting configured — team won't be notified of future incidents
Recommend: PagerDuty or OpsGenie integration
These recommendations feed back to /calibrate — next time it runs, it includes
incident-driven recommendations alongside the standard ones.
After resolution, analyze the incident for patterns:
# What to learn:
1. Classification accuracy — did we route to the right skill?
If /sre was invoked but it turned out to be a code bug → improve type classification
2. Diagnosis speed — which of the 4 parallel checks found the answer?
If KB lookup found a matching past incident → that was the fastest path
If health check was the key signal → health checks are working well
3. Resolution effectiveness — did the routed skill fix it?
If /fix resolved it → code bug pattern, add to bug-patterns.md
If /sre resolved it → infra pattern, add to sre knowledge
If manual intervention was needed → gap in automation
4. Time to resolution — how long from report to verified fix?
Track per severity and type for trending
After every incident, append to .claude/knowledge/skills/incident.md:
### {date} — {title} ({severity}, {type})
**Classified as**: {type} — {was this correct? yes/no}
**Routed to**: {skill}
**Root cause**: {summary}
**Resolution time**: {duration}
**Key signal**: {which Phase 2 check found the answer}
**Knowledge gap**: {what we didn't know that slowed us down}
**New pattern**: {if this is a new type of incident, describe it}
**Similar past**: {count of similar incidents — is this recurring?}
After 5+ incidents are logged, check for patterns:
Incident patterns detected:
⚠ 3 incidents in backend/app/services/auth.py in last 30 days
→ Consider: comprehensive auth service refactor
⚠ 2 CI failures from dependency updates in last 2 weeks
→ Consider: pin dependencies, add lockfile check
⚠ Database slow query incidents increasing
→ Consider: add query monitoring, review indexes
Surface these in /onboard briefing so the team sees systemic issues, not just
individual incidents.
Read .claude/knowledge/skills/incident.md at the start of every incident. If past
incidents show that certain keywords were misclassified:
# From knowledge base:
# "auth timeout" was classified as INFRA but was actually CODE BUG (3 times)
# → Override: "auth.*timeout" → CODE BUG, not INFRA
The classification tables in Phase 1 are defaults. Knowledge base corrections override them for this specific project.
When /calibrate runs, it discovers:
After incidents, /incident recommends:
These get included in /calibrate's recommendations next time it runs.
If .claude/project-profile.md exists:
If it doesn't exist:
/calibrateSomething is wrong
│
▼
/incident "{description}"
│
├── Phase 1: CLASSIFY (instant)
│ Severity: CRITICAL / HIGH / MEDIUM / LOW
│ Type: infra / CI / code bug / perf / data / auth
│ → Notify team: "Incident detected, investigating"
│
├── Phase 2: PARALLEL DIAGNOSIS (4 agents, < 60s)
│ ├── Health endpoints (explore-light, 1x)
│ ├── Recent deploys + CI (ops, 1x)
│ ├── Error signals + logs (ops, 1x)
│ └── Knowledge base lookup (ops, 1x)
│ → Notify team: "Diagnosis complete, routing"
│
├── Phase 3: CORRELATE
│ Health × Deploy × CI × Errors = Diagnosis
│ → Hypothesis + preliminary confidence (HIGH/MEDIUM/LOW)
│
├── Phase 3b: CHALLENGE (adversarial swarm)
│ ├── CRITICAL + HIGH conf → Skip (0s)
│ ├── CRITICAL + med/low → Fast: 1 Haiku devil's advocate (30s)
│ └── HIGH/MEDIUM/LOW → Full: 2 Haiku agents, 2-hop debate (90-120s)
│ ├── devil-advocate: challenges primary hypothesis
│ └── alt-hypothesis: cross-domain verification
│
├── Phase 3c: VERDICT GATE
│ Confidence score → PROCEED / MERGE / ESCALATE
│ → If ESCALATE: present options [A] primary [B] challenger [C] both
│
├── Phase 4: ROUTE + RESOLVE
│ ├── Infra → /sre debug (with all Phase 2 context)
│ ├── CI → /ci-fix (with branch + patterns)
│ ├── Code bug → /fix (with error logs + similar incidents)
│ ├── Perf → /sre debug (perf focus)
│ └── Local → /restart or ops agent
│ → Notify team: "Root cause: {X}. Fix in progress."
│ → Accept stakeholder input: constraints, context, steering
│
├── Phase 5: VERIFY + REPORT
│ ├── Re-run health checks
│ ├── Write incident report
│ ├── Update knowledge base
│ └── Notify team: "Resolved. Duration: {X}."
│
└── LEARN
├── Was classification correct?
├── Which diagnosis signal was key?
├── Is this a recurring pattern?
└── What data sources were missing?