Search everything...

Slash Command

/incident

Incident triage orchestrator — classifies severity, diagnoses in parallel, routes to /sre, /ci-fix, or /fix based on evidence. Usage: /incident <description>

From sentinel

Install

Run in your terminal

npx claudepluginhub arthtech-ai/arthai-marketplace --plugin sentinel

Command Content

Other plugins with /incident

/incident

Guides interactive incident response workflow: triage severity, mitigate issues, perform root cause analysis with 5 Whys, resolve, and generate post-mortem documentation. Also supports SEV1 and post-mortem modes.

ReadWriteGrep+1

universal-dev-standards

/incident

Orchestrates incident response for specified <incident> using SRE best practices, supporting optional [phase] like triage or postmortem.

sonnetTaskBashRead+1

observability-ops

/incident

Open and manage a production incident. Creates a structured incident document, guides root cause analysis, generates status page updates and Slack messages, and sets up a post-mortem after resolution.

clarc

/incident

crew/

Log a production incident linked to the current crew project

wicked-garden

/incident

godmode/

Incident response and blameless post-mortem management. Classifies incidents by severity (SEV1-4), constructs precise timelines, performs root cause analysis, and produces actionable post-mortem documents with tracked action items.

godmode

/incident

platform/

Incident response and triage

wicked-garden

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

/incident | sentinel | ClaudePluginHub

Slash Command

/incident

Incident triage orchestrator — classifies severity, diagnoses in parallel, routes to /sre, /ci-fix, or /fix based on evidence. Usage: /incident <description>

From sentinel

Install

Run in your terminal

npx claudepluginhub arthtech-ai/arthai-marketplace --plugin sentinel

Command Content

/incident — Incident Triage Orchestrator

The single entry point for anything going wrong. Classifies, diagnoses, and routes to the right skill — you don't need to figure out if it's an infra issue, CI failure, or code bug. /incident figures it out for you.

Flow: Classify → Parallel Diagnosis → Correlate → Challenge → Verdict → Route → Resolve → Learn

When to Use

"The site is down"
"CI is failing"
"There's a 500 error in production"
"Staging deploy failed"
"The database is slow"
"Users are getting timeouts"
"Fix issue #234"
Any time something is broken and you don't know which skill to use

Argument Parsing

Input	Action
`/incident the site is down`	Free-text incident description
`/incident #234`	Load incident from GitHub issue
`/incident` (no args)	Auto-detect — check health, CI, recent deploys

Phase 1: Classify (instant — no agents, just pattern matching)

From the description, classify severity and type immediately.

Severity Classification

Keywords	Severity	Response Time
down, outage, crash, 500 in production, data loss, security breach	CRITICAL	Act immediately, no questions
slow, timeout, degraded, errors increasing, staging broken	HIGH	Act within minutes
failing, broken, error, not working, regression	MEDIUM	Investigate, then act
flaky, intermittent, warning, minor, cosmetic	LOW	Queue for next available time

Type Classification

Keywords	Type	Primary Route
CI, pipeline, build failed, lint, test fail, Actions, workflow	CI Failure	`/ci-fix`
deploy, health, down, outage, production, staging, 500, 502, 503	Infrastructure	`/sre debug`
bug, fix, issue #, regression, broken feature, wrong behavior	Code Bug	`/fix`
slow, performance, timeout, latency, memory, CPU	Performance	`/sre debug` (perf focus)
restart, start, stop, hung, frozen, local	Local Ops	`/restart` or ops agent
database, migration, schema, data corruption	Data Issue	`/sre debug` (data focus)
auth, login, permission, 401, 403	Auth Issue	Could be infra OR code — diagnose first

Output after classification:

INCIDENT — {severity}: {type}
Description: {user's description}

Don't ask questions for CRITICAL severity. Just proceed to Phase 2. For HIGH/MEDIUM, briefly confirm the description then proceed. For LOW, ask if they want full triage or just a quick check.

Phase 2: Parallel Diagnosis (< 60 seconds)

Spawn 4 cheap agents IN PARALLEL to gather evidence fast. Use explore-light and ops agents (Haiku, 1x cost) — not Sonnet/Opus. Speed and cost matter here.

2.1: Health Check (explore-light)

subagent_type: "explore-light"
name: "health-check"
prompt: "Read CLAUDE.md for health endpoints and infrastructure details.
Hit every health endpoint listed. Also check:
- Production URL: {from CLAUDE.md or project-profile.md}
- Staging URL: {if configured}
Report pass/fail for each endpoint with response time and status code."

2.2: Recent Deploys (ops agent)

subagent_type: "ops"
name: "deploy-check"
prompt: "Check recent deployment activity:
1. git log --oneline -5 main (what recently shipped)
2. gh run list --branch main --limit 5 (CI status)
3. If Railway: railway status (or platform-appropriate command)
Report: what was the last deploy, when, did CI pass, any errors."

2.3: Error Signals (ops agent)

subagent_type: "ops"
name: "error-check"
prompt: "Check for error signals:
1. If deploy platform has logs: get last 50 lines, filter for ERROR/WARN/FATAL
2. If Sentry configured (check .env or CLAUDE.md): note DSN existence
3. If Docker running: docker ps --format table, check for unhealthy/restarting containers
4. Check if any services are expected but not running (from CLAUDE.md Local Dev Services)
Report: active errors, unhealthy services, recent error patterns."

2.4: Knowledge Base Lookup (ops agent)

subagent_type: "ops"
name: "kb-lookup"
prompt: "Search for similar past incidents:
1. Read .claude/qa-knowledge/incidents/ — any files mentioning {affected area keywords}
2. Read .claude/qa-knowledge/bug-patterns.md — any patterns matching {symptoms}
3. Read .claude/knowledge/agents/sre.md — any past resolutions for similar issues
Report: similar past incidents with root causes and how they were resolved."

CRITICAL: Spawn all 4 in ONE message for maximum parallelism.

Phase 3: Correlate (analyze the evidence)

When all 4 agents report back, correlate their findings:

Decision Matrix

Health	Last Deploy	CI	Errors	Diagnosis	Route
DOWN	Recent (< 1hr)	Green	Deploy errors in logs	Bad deploy	Revert or `/fix`
DOWN	Recent (< 1hr)	Red	Test failures	Broken CI shipped	Revert, then `/ci-fix`
DOWN	None recent	—	Infra errors	Infrastructure failure	`/sre debug`
DOWN	None recent	—	No errors	External dependency	Check 3rd party status
UP	—	Red	CI errors	CI broken, prod OK	`/ci-fix` (not urgent)
UP	—	Green	App errors in logs	Code bug in prod	`/fix` with log context
UP	—	Green	No errors	Intermittent/resolved	Monitor, check if still happening
SLOW	Recent	—	Timeout errors	Perf regression from deploy	`/sre debug` + `/fix`
SLOW	None recent	—	DB slow queries	Database performance	`/sre debug` (data focus)
N/A	—	Red	—	CI failure only	`/ci-fix`
N/A	—	—	Local errors	Local dev issue	ops agent or `/restart`

Correlation Output

DIAGNOSIS:
  Severity: {CRITICAL/HIGH/MEDIUM/LOW}
  Type: {infrastructure/CI/code bug/performance/data/auth}
  Evidence:
    Health: {UP/DOWN/SLOW} — {details}
    Last deploy: {time} — {commit message}
    CI: {GREEN/RED} — {details}
    Errors: {summary}
    Similar past incident: {if found}

  Root cause hypothesis: {what we think happened based on correlation}

  Recommended action: {specific skill to invoke}

  Preliminary confidence: {HIGH (4 signals aligned) / MEDIUM (3 signals aligned) / LOW (2 or fewer signals aligned)}
  Dissenting signals: {list any signals that don't fit the primary hypothesis, or "none"}

Confidence is computed from how many of the 4 diagnostic signals (Health, Last Deploy, CI, Errors) align with the matched Decision Matrix row:

4 signals aligned = HIGH confidence
3 signals aligned = MEDIUM confidence
2 or fewer signals aligned = LOW confidence

Phase 3b: Competing Hypothesis Challenge

Before routing to a resolution skill, subject the primary diagnosis to adversarial challenge. 90 seconds of verification prevents 30 minutes of wrong-path investigation.

Mode Selection

Severity	Confidence	Mode	Time Budget	Agents	Condition
CRITICAL	HIGH	Skip	0s	None	Unambiguous evidence — act now
CRITICAL	MEDIUM or LOW	Fast Challenge	30s	1 Haiku (qa-challenger)	Quick sanity check before committing
HIGH	Any	Full Challenge	90s	2 Haiku agents	Worth 90s to avoid 30min wrong-path
MEDIUM	Any	Full Challenge	90s	2 Haiku agents	Same as HIGH
LOW	Any	Full Challenge	120s	2 Haiku agents	Extra time, low urgency

Fast Challenge Mode (1 agent, 30s)

Spawn a single qa-challenger subagent as devil's advocate:

subagent_type: "qa-challenger"
model: haiku
name: "devil-advocate-fast"
prompt: "You are a devil's advocate reviewing an incident diagnosis.

Incident: {original description}

Evidence gathered:
  Health: {Phase 2.1 result}
  Last deploy: {Phase 2.2 result}
  CI status: {Phase 2.2 CI result}
  Error signals: {Phase 2.3 result}
  KB matches: {Phase 2.4 result}

Primary diagnosis: {hypothesis from Phase 3}
Recommended action: {skill from Phase 3}

Your job: in 30 seconds, challenge this diagnosis.
- Is there an alternative explanation that fits the evidence better?
- Is any key evidence being ignored?
- Could this be a different failure mode?

Output exactly:
CHALLENGE: {alternative hypothesis} OR NONE — diagnosis looks correct
CONFIDENCE: HIGH / MEDIUM / LOW
KEY EVIDENCE: {the piece of evidence that drives your challenge, or n/a}"

Orchestrator enforces the 30-second time budget. If no response in 30s, proceed with primary hypothesis.

Full Challenge Mode (2 agents, 90-120s)

Create an agent team for cross-domain adversarial review:

TeamCreate: "incident-challenge-{timestamp}"

Agent 1 — devil-advocate:
  subagent_type: "qa-challenger"
  model: haiku
  role: Challenges primary hypothesis, proposes alternatives

Agent 2 — alt-hypothesis:
  subagent_type: "sre"   (if primary routes to /fix or /ci-fix)
               OR "ops"  (if primary routes to /sre debug — always the OPPOSITE domain)
  model: haiku
  role: Cross-domain verification of the challenge

Agent 2 domain selection rule: always choose the domain OPPOSITE to where the primary hypothesis routes. If the primary diagnosis says "code bug → /fix", Agent 2 is an SRE (infra perspective). If the primary says "infrastructure → /sre debug", Agent 2 is an ops agent (code/config perspective).

SendMessage flow (max 2 hops):

Hop 1 — devil-advocate → alt-hypothesis:

"Challenge: Primary diagnosis says {primary hypothesis} but I want your cross-domain view.

Evidence:
  Health: {Phase 2.1 result}
  Last deploy: {Phase 2.2 result}
  CI status: {Phase 2.2 CI result}
  Error signals: {Phase 2.3 result}

My alternative hypothesis: {alternative — or 'I agree with primary if no better alternative'}

What do you see from your domain perspective?"

Hop 2 — alt-hypothesis → devil-advocate:

"Verdict: {AGREE-PRIMARY | AGREE-CHALLENGE | THIRD-HYPOTHESIS}
Reasoning: {brief — max 2 sentences}
Additional evidence: {any cross-domain signal that supports your verdict}"

devil-advocate → team lead (orchestrator): final positions summary:

CHALLENGE SUMMARY:
  devil-advocate position: {primary is correct / alternative: X}
  alt-hypothesis verdict: {AGREE-PRIMARY / AGREE-CHALLENGE / THIRD-HYPOTHESIS: X}
  Key disagreement: {what they disagreed on, or "none — consensus reached"}

Urgency Override

If at any point during Phase 3b the user sends "just fix it" (or equivalent urgency signal), immediately proceed with the primary hypothesis. Skip remaining challenge steps. Document the override in the incident report.

Phase 3c: Verdict Gate

Compute a confidence score from Phase 3b results. This score determines whether to proceed, merge, or escalate.

Confidence Score Computation

Start at 1.0 and apply adjustments:

Signal	Adjustment
DA says "NONE — diagnosis looks correct"	+0.0 (stay at 1.0)
DA proposes alternative with HIGH confidence	-0.4
DA proposes alternative with MEDIUM confidence	-0.2
DA proposes alternative with LOW confidence	-0.1
alt-hypothesis AGREES with challenge (not primary)	-0.3
alt-hypothesis proposes THIRD hypothesis	-0.2
alt-hypothesis AGREES with primary	+0.1 (cap at 1.0)
KB found matching past incident (Phase 2.4)	+0.1 (cap at 1.0)
Fast Challenge mode (only 1 agent ran)	No adjustment — use score as-is
Skip mode (CRITICAL + HIGH confidence)	Score = 1.0 by definition

Verdict Paths

Score	Verdict	Action
>= 0.8	PROCEED	Continue with primary hypothesis unchanged
0.5 – 0.79	MERGE	Incorporate challenger insights into the fix context. Document fallback hypothesis for Phase 4 agents. Proceed with primary but stay alert for signs of the alternative.
< 0.5	ESCALATE	Do not auto-route. Present competing hypotheses to user.

ESCALATE output format:

Competing hypotheses — your call:

[A] Primary: {primary hypothesis}
    Evidence for: {supporting signals}
    Confidence: {score before challenge}

[B] Challenger: {alternative hypothesis}
    Evidence for: {challenger's key evidence}
    From: {da / alt-hypothesis / both}

[C] Investigate both — run parallel diagnosis targeting both hypotheses

Which path? (A / B / C)

Verdict Output

VERDICT GATE:
  Challenge mode: {skip / fast / full}
  Primary hypothesis: {diagnosis}
  Challenger finding: {agreed / alternative: X / third hypothesis: X}
  Confidence score: {0.0–1.0}
  Verdict: {PROCEED / MERGE / ESCALATE}
  Fallback hypothesis: {if MERGE — what to watch for during resolution}

Phase 4: Route to Resolution

Based on the diagnosis, invoke the appropriate skill. Do NOT ask the user which skill to use. The correlation already determined it.

4.1: Infrastructure Issue → `/sre debug`

Invoke /sre debug with context:
- Description: {original description}
- Health check results: {from Phase 2.1}
- Deploy history: {from Phase 2.2}
- Error logs: {from Phase 2.3}
- Similar incidents: {from Phase 2.4}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

The SRE agent gets all the context from parallel diagnosis — doesn't need to re-discover.

4.2: CI Failure → `/ci-fix`

Invoke /ci-fix with context:
- Mode: ci (or staging/prod based on diagnosis)
- Branch: {from deploy check}
- Known patterns: {from KB lookup}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

4.3: Code Bug → `/fix`

Invoke /fix with context:
- Description: {original description + diagnosis}
- Severity: {from Phase 1}
- Error logs: {from Phase 2.3 — gives the agent a head start on root cause}
- Similar incidents: {from Phase 2.4 — may identify root cause immediately}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

If an issue number was provided (/incident #234), pass it: /fix #234 --severity {classified severity}

4.4: Performance Issue → `/sre debug` with perf focus

Invoke /sre debug with:
- Description: "Performance degradation: {description}"
- Focus: "Use RED method (Rate/Errors/Duration) for API endpoints.
  Check: database query times, connection pool, cache hit rates, memory usage."
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

4.5: Local Ops → `/restart` or ops agent

For local issues (services not starting, local errors):
- Invoke /restart skill
- Or spawn ops agent for specific task

4.6: Unknown / Ambiguous → Full `/sre status` then decide

If correlation is inconclusive:
1. Run /sre status for complete system overview
2. Present findings to user
3. Ask: "Based on this, it looks like {hypothesis}. Should I proceed with {skill}?"

Phase 5: Post-Resolution

After the routed skill completes its full lifecycle. Important: different skills have different post-implementation flows — /incident must wait for the ENTIRE flow, not just the fix itself.

Coordination with routed skill lifecycle

Routed to	What that skill does after fixing	When /incident Phase 5 starts
`/fix`	Step 6: QA → restart servers → user tests → `/pr --skip-qa` → PR created	After PR is created (Step 6.4 completes)
`/ci-fix`	Retries CI, may push fixes	After CI passes or exhausts retries
`/sre debug`	Infra fix (no code) → verify health	Immediately after health verified
`/sre debug` → escalates to `/fix`	Same as `/fix` row above	After `/fix`'s full lifecycle completes
`/restart`	Restarts services → verify health	Immediately after health verified

Key rule: Do NOT duplicate QA, server restarts, or user testing that the routed skill already handles. /incident Phase 5 is about verification, reporting, and learning — not re-running the same checks.

5.1: Verify resolution

For code bug routes (/fix, /ci-fix): The routed skill already ran QA, restarted servers, and got user confirmation. Phase 5.1 only re-runs health checks to confirm the deployed fix (if auto-deployed) or that local state is clean:

# Re-run the health checks from Phase 2.1
# Compare: were the failing endpoints now passing?
# Compare: are the error signals gone?

For infra routes (/sre debug, /restart): These don't go through QA/PR. Phase 5.1 is the primary verification — check health endpoints and error signals. If the infra fix involved config or environment changes that could affect behavior:

After infra fix, quick sanity check:
1. Hit all health endpoints (from CLAUDE.md Environments table)
2. Check error signals have cleared
3. If config was changed: ask user to smoke-test manually

5.2: Write incident report

Create .claude/qa-knowledge/incidents/{date}-{slug}.md:

---
status: resolved
severity: {severity}
type: {type}
affected: {services/endpoints}
duration: {time from report to resolution}
root_cause: {from the skill that fixed it}
resolved_by: {/sre, /ci-fix, /fix, /restart}
---

## Timeline
- {time}: Incident reported — "{original description}"
- {time}: Parallel diagnosis — {what was found}
- {time}: Routed to {skill} — {diagnosis}
- {time}: Resolution — {what was done}
- {time}: Verified — {health checks pass}

## Root Cause
{from the resolving skill}

## Prevention
{what would prevent this from happening again}

## Similar Past Incidents
{from Phase 2.4 KB lookup}

## Challenge Results
- Challenge mode: {skip / fast / full}
- Primary hypothesis: {diagnosis from Phase 3}
- Challenger finding: {agreed / alternative proposed: X / third hypothesis: X}
- Verdict: {PROCEED / MERGE / ESCALATE}
- Confidence score: {0.0–1.0}

5.3: Update knowledge base

Append to .claude/knowledge/agents/sre.md: incident summary + resolution
If new pattern discovered: append to .claude/qa-knowledge/bug-patterns.md
If the diagnosis was wrong (routed to wrong skill): note the misclassification so future triage is more accurate
Append to .claude/knowledge/skills/incident.md: challenge effectiveness data — whether the challenger caught a real issue, whether the primary hypothesis was correct, and the final confidence score. This builds a record of when challenge modes add value vs. when they confirm the obvious.

5.4: Notify team

If Discord MCP is available:

Send to #deployments:
  "Incident resolved — {severity} {type}
   Root cause: {summary}
   Fix: {what was done}
   Duration: {time}"

5.5: What's next

After resolution is verified and reported:

Incident resolved. Duration: {time}.

What's next?
  [1] Monitor — keep watching for recurrence (/sre health in 5 min)
  [2] Fix another issue — /fix #N or /incident <description>
  [3] See project status — /onboard
  [4] Done for now

For CRITICAL/HIGH severity, default to option 1 (monitor) and suggest running /sre health after 5-10 minutes to confirm the fix holds.

Auto-Detect Mode (`/incident` with no args)

When invoked without a description, run Phase 2 checks proactively:

Hit all health endpoints
Check CI status
Check recent deploys
Check for error signals

If everything is green:

All systems healthy:
  ✓ Health endpoints: all responding
  ✓ CI: last 3 runs green
  ✓ Deploy: last deploy {time ago}, healthy
  ✓ No error signals detected

Nothing to triage. What prompted the check?

If something is wrong, proceed to Phase 3 correlation, Phase 3b challenge, Phase 3c verdict, and Phase 4 routing automatically.

Escalation Rules

Condition	Escalation
/sre debug finds a code bug	→ Escalate to `/fix` with SRE's findings as context
/ci-fix exhausts 3 attempts	→ Escalate to `/fix` (deeper investigation needed)
/fix finds an infra issue (not code)	→ Escalate back to `/sre debug`
Any skill fails to resolve in 15 min	→ Alert on Discord, present options to user
CRITICAL severity not resolved in 30 min	→ Suggest revert: `git revert HEAD && git push`

Integration with Existing Skills

Skill	How /incident uses it
`/sre debug`	Routed to for infrastructure issues, gets pre-gathered context
`/ci-fix`	Routed to for CI failures, gets branch and pattern context
`/fix`	Routed to for code bugs, gets error logs and similar incidents
`/restart`	Routed to for local ops issues
`/qa`	Run after resolution to verify no regressions
explore-light	Phase 2 health checks (1x cost)
ops agent	Phase 2 deploy/error/KB checks (1x cost)
`/review-pr`	After /fix creates a PR, review it before merge

Cost Model

By Phase

Phase	Agents	Cost
Phase 1: Classify	None (pattern matching)	0
Phase 2: Diagnose	4 Haiku agents in parallel	4x
Phase 3: Correlate	None (analysis)	0
Phase 3b: Challenge (skip)	None	0
Phase 3b: Challenge (fast)	1 Haiku agent	1x
Phase 3b: Challenge (full)	2 Haiku agents	2x
Phase 4: Route	1 Sonnet agent (SRE/fix/ci-fix)	10x
Phase 5: Verify	1 Haiku agent (health check)	1x

By Severity (total cost including challenge)

Severity	Challenge Mode	Total Cost	Notes
CRITICAL + HIGH confidence	Skip	~15x	No challenge overhead
CRITICAL + med/low confidence	Fast (1 Haiku)	~16x	+1x for 30s sanity check
HIGH	Full (2 Haiku)	~17x	+2x for 90s full challenge
MEDIUM	Full (2 Haiku)	~17x	Same as HIGH
LOW	Full (2 Haiku)	~17x	120s budget, same agent cost

Break-even analysis: Full challenge adds ~2x cost (2 Haiku agents). A single wrong routing — running /sre debug when the issue is a code bug — wastes a full Sonnet invocation (10x) plus the time to re-diagnose and re-route. Challenge pays for itself if it catches one misrouting in every 5 incidents. At typical incident rates, that's almost always worth it.

Rules

CRITICAL = no questions. Diagnose and act immediately.
Parallel diagnosis is mandatory. Never diagnose sequentially — 4 agents at once.
Evidence before routing. Don't guess which skill to use — correlate first.
Challenge before routing (unless CRITICAL + HIGH confidence). 90 seconds of verification prevents 30 minutes of wrong-path investigation.
Challenge uses Haiku. Never escalate challenge agents to Sonnet/Opus. Speed and cost matter — Haiku is sufficient for adversarial review.
Escalation is automatic. If /sre finds a code bug, it routes to /fix without asking.
Every incident gets a report. Even if resolved in 30 seconds.
Knowledge base is always checked. Similar past incidents save diagnosis time.
Verification after resolution. Re-run health checks to confirm the fix worked.
Notify stakeholders. Communication channels are checked and used at every phase.

Stakeholder Communication

Incidents don't happen in a vacuum. Stakeholders need updates throughout — not just at the end.

Communication Channels (auto-detect from project)

Check these sources to find configured channels:

Signal	Channel	How to use
Discord MCP in `.claude/settings.json`	Discord	`mcp__discord-mcp__send-message(channel, message)`
`slack@claude-plugins-official` enabled	Slack	Slack plugin send message
`SLACK_WEBHOOK_URL` in env	Slack webhook	`curl -X POST -d '{"text":"..."}' $SLACK_WEBHOOK_URL`
`DISCORD_WEBHOOK_URL` in env	Discord webhook	`curl -X POST -d '{"content":"..."}' $DISCORD_WEBHOOK_URL`
`PAGERDUTY_*` in env	PagerDuty	PagerDuty API for escalation
`OPSGENIE_*` in env	OpsGenie	OpsGenie API for alerting
GitHub issue exists for the incident	GitHub	Comment on the issue with updates
`TEAMS_WEBHOOK_URL` in env	Microsoft Teams	Teams webhook API
Notion MCP configured	Notion	Create/update incident page
Linear MCP configured	Linear	Create/update incident issue

On first run, detect which channels are available and store in knowledge base.

Communication Protocol

Phase	Who to notify	What to say	Channel
Phase 1 (Classify)	On-call / team	"Incident detected: {severity} — {description}. Investigating."	Discord/Slack #incidents
Phase 3 (Correlate)	On-call / team	"Diagnosis: {type} — {hypothesis}. Routing to {skill}."	Discord/Slack #incidents
Phase 4 (mid-fix)	Stakeholders if CRITICAL	"Update: root cause identified — {cause}. Fix in progress. ETA: {estimate}."	Discord/Slack #incidents + PagerDuty
Phase 5 (Resolved)	Everyone	"Resolved: {root cause}. Fix: {what was done}. Duration: {time}."	Discord/Slack #incidents + GitHub issue
Post-incident	Team lead	Full incident report with timeline, root cause, prevention	Knowledge base + Notion/Linear

Stakeholder Input During Incident

If the user (or a stakeholder via Discord/Slack) provides additional context during the incident, incorporate it:

User: "actually it only affects users in EU"
  → Update scope: EU region only
  → Check: is there a regional configuration difference?
  → Update diagnosis with this context

User: "we just deployed a config change 10 minutes ago"
  → Highest suspect: the config change
  → Check the config diff
  → Skip broad diagnosis, focus on config

User: "don't revert, we need that deploy"
  → Respect the constraint
  → Find a forward fix instead of revert
  → Update incident notes with the constraint

Data Source Integration

Automatic Data Sources (always available)

Source	What it provides	How to access
Git history	Recent commits, who changed what	`git log`, `git blame`
GitHub/GitLab	CI status, PRs, issues, deployments	`gh` CLI
CLAUDE.md	Health endpoints, service topology	Read tool
Docker	Container health, resource usage	`docker ps`, `docker stats`
Project logs	Application errors, access logs	Platform-specific CLI

MCP-Connected Data Sources (if configured)

MCP Server	What it unlocks
Postgres MCP	Direct DB queries — check connection count, slow queries, table sizes
Redis MCP	Cache hit rates, memory usage, connected clients
MongoDB MCP	Collection stats, slow operations, replica set health
Sentry MCP	Error rates, affected users, stack traces, release health
Datadog MCP	APM traces, infrastructure metrics, log patterns
AWS MCP	CloudWatch metrics, ECS task health, RDS stats, Lambda errors
GCP MCP	Cloud Monitoring, Error Reporting, Trace
Cloudflare MCP	Edge analytics, WAF events, origin health
PagerDuty MCP	Active incidents, on-call schedule

Recommending Data Sources

If diagnosis is limited by missing data sources:

Incident triage limited — missing data sources:

⚠ No monitoring connected — can't check error rates or performance metrics
  Recommend: Sentry MCP (free tier) or Datadog MCP
  /calibrate can install this automatically

⚠ No database MCP — can't check query performance or connection health
  Recommend: Postgres MCP / Redis MCP based on your stack
  /calibrate can install this automatically

⚠ No alerting configured — team won't be notified of future incidents
  Recommend: PagerDuty or OpsGenie integration

These recommendations feed back to /calibrate — next time it runs, it includes incident-driven recommendations alongside the standard ones.

Adaptive Learning

Learn from every incident

After resolution, analyze the incident for patterns:

# What to learn:
1. Classification accuracy — did we route to the right skill?
   If /sre was invoked but it turned out to be a code bug → improve type classification

2. Diagnosis speed — which of the 4 parallel checks found the answer?
   If KB lookup found a matching past incident → that was the fastest path
   If health check was the key signal → health checks are working well

3. Resolution effectiveness — did the routed skill fix it?
   If /fix resolved it → code bug pattern, add to bug-patterns.md
   If /sre resolved it → infra pattern, add to sre knowledge
   If manual intervention was needed → gap in automation

4. Time to resolution — how long from report to verified fix?
   Track per severity and type for trending

Write learning to knowledge base

After every incident, append to .claude/knowledge/skills/incident.md:

### {date} — {title} ({severity}, {type})
**Classified as**: {type} — {was this correct? yes/no}
**Routed to**: {skill}
**Root cause**: {summary}
**Resolution time**: {duration}
**Key signal**: {which Phase 2 check found the answer}
**Knowledge gap**: {what we didn't know that slowed us down}
**New pattern**: {if this is a new type of incident, describe it}
**Similar past**: {count of similar incidents — is this recurring?}

Pattern Detection

After 5+ incidents are logged, check for patterns:

Incident patterns detected:
  ⚠ 3 incidents in backend/app/services/auth.py in last 30 days
    → Consider: comprehensive auth service refactor
  ⚠ 2 CI failures from dependency updates in last 2 weeks
    → Consider: pin dependencies, add lockfile check
  ⚠ Database slow query incidents increasing
    → Consider: add query monitoring, review indexes

Surface these in /onboard briefing so the team sees systemic issues, not just individual incidents.

Improve classification over time

Read .claude/knowledge/skills/incident.md at the start of every incident. If past incidents show that certain keywords were misclassified:

# From knowledge base:
# "auth timeout" was classified as INFRA but was actually CODE BUG (3 times)
# → Override: "auth.*timeout" → CODE BUG, not INFRA

The classification tables in Phase 1 are defaults. Knowledge base corrections override them for this specific project.

/calibrate Integration

What /calibrate contributes to /incident

When /calibrate runs, it discovers:

Deploy platform → /incident knows where to check logs
Monitoring tools → /incident knows which MCP servers to query
CI platform → /incident knows how to check CI status
Health endpoints → /incident knows what to hit first
Communication channels → /incident knows where to send alerts

What /incident contributes to /calibrate

After incidents, /incident recommends:

Missing monitoring → "add Sentry MCP"
Missing alerting → "add PagerDuty integration"
Missing health endpoints → "add /health to service X"
Missing data sources → "add Postgres MCP for query diagnostics"

These get included in /calibrate's recommendations next time it runs.

Project Profile Usage

If .claude/project-profile.md exists:

Read infrastructure section → skip platform detection
Read external integrations → know which MCP servers are available
Read domain model → understand which services are critical vs auxiliary

If it doesn't exist:

Do full discovery (Phase 2 handles this)
Recommend running /calibrate

Workflow: Full Incident Lifecycle

Something is wrong
    │
    ▼
/incident "{description}"
    │
    ├── Phase 1: CLASSIFY (instant)
    │   Severity: CRITICAL / HIGH / MEDIUM / LOW
    │   Type: infra / CI / code bug / perf / data / auth
    │   → Notify team: "Incident detected, investigating"
    │
    ├── Phase 2: PARALLEL DIAGNOSIS (4 agents, < 60s)
    │   ├── Health endpoints (explore-light, 1x)
    │   ├── Recent deploys + CI (ops, 1x)
    │   ├── Error signals + logs (ops, 1x)
    │   └── Knowledge base lookup (ops, 1x)
    │   → Notify team: "Diagnosis complete, routing"
    │
    ├── Phase 3: CORRELATE
    │   Health × Deploy × CI × Errors = Diagnosis
    │   → Hypothesis + preliminary confidence (HIGH/MEDIUM/LOW)
    │
    ├── Phase 3b: CHALLENGE (adversarial swarm)
    │   ├── CRITICAL + HIGH conf → Skip (0s)
    │   ├── CRITICAL + med/low  → Fast: 1 Haiku devil's advocate (30s)
    │   └── HIGH/MEDIUM/LOW     → Full: 2 Haiku agents, 2-hop debate (90-120s)
    │       ├── devil-advocate: challenges primary hypothesis
    │       └── alt-hypothesis: cross-domain verification
    │
    ├── Phase 3c: VERDICT GATE
    │   Confidence score → PROCEED / MERGE / ESCALATE
    │   → If ESCALATE: present options [A] primary [B] challenger [C] both
    │
    ├── Phase 4: ROUTE + RESOLVE
    │   ├── Infra → /sre debug (with all Phase 2 context)
    │   ├── CI → /ci-fix (with branch + patterns)
    │   ├── Code bug → /fix (with error logs + similar incidents)
    │   ├── Perf → /sre debug (perf focus)
    │   └── Local → /restart or ops agent
    │   → Notify team: "Root cause: {X}. Fix in progress."
    │   → Accept stakeholder input: constraints, context, steering
    │
    ├── Phase 5: VERIFY + REPORT
    │   ├── Re-run health checks
    │   ├── Write incident report
    │   ├── Update knowledge base
    │   └── Notify team: "Resolved. Duration: {X}."
    │
    └── LEARN
        ├── Was classification correct?
        ├── Which diagnosis signal was key?
        ├── Is this a recurring pattern?
        └── What data sources were missing?

Other plugins with /incident

/incident

ReadWriteGrep+1

universal-dev-standards

/incident

Orchestrates incident response for specified <incident> using SRE best practices, supporting optional [phase] like triage or postmortem.

sonnetTaskBashRead+1

observability-ops

/incident

clarc

/incident

crew/

Log a production incident linked to the current crew project

wicked-garden

/incident

godmode/

godmode

/incident

platform/

Incident response and triage

wicked-garden

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

/incident — Incident Triage Orchestrator

Flow: Classify → Parallel Diagnosis → Correlate → Challenge → Verdict → Route → Resolve → Learn

When to Use

"The site is down"
"CI is failing"
"There's a 500 error in production"
"Staging deploy failed"
"The database is slow"
"Users are getting timeouts"
"Fix issue #234"
Any time something is broken and you don't know which skill to use

Argument Parsing

Input	Action
`/incident the site is down`	Free-text incident description
`/incident #234`	Load incident from GitHub issue
`/incident` (no args)	Auto-detect — check health, CI, recent deploys

Phase 1: Classify (instant — no agents, just pattern matching)

From the description, classify severity and type immediately.

Severity Classification

Keywords	Severity	Response Time
down, outage, crash, 500 in production, data loss, security breach	CRITICAL	Act immediately, no questions
slow, timeout, degraded, errors increasing, staging broken	HIGH	Act within minutes
failing, broken, error, not working, regression	MEDIUM	Investigate, then act
flaky, intermittent, warning, minor, cosmetic	LOW	Queue for next available time

Type Classification

Keywords	Type	Primary Route
CI, pipeline, build failed, lint, test fail, Actions, workflow	CI Failure	`/ci-fix`
deploy, health, down, outage, production, staging, 500, 502, 503	Infrastructure	`/sre debug`
bug, fix, issue #, regression, broken feature, wrong behavior	Code Bug	`/fix`
slow, performance, timeout, latency, memory, CPU	Performance	`/sre debug` (perf focus)
restart, start, stop, hung, frozen, local	Local Ops	`/restart` or ops agent
database, migration, schema, data corruption	Data Issue	`/sre debug` (data focus)
auth, login, permission, 401, 403	Auth Issue	Could be infra OR code — diagnose first

Output after classification:

INCIDENT — {severity}: {type}
Description: {user's description}

Don't ask questions for CRITICAL severity. Just proceed to Phase 2. For HIGH/MEDIUM, briefly confirm the description then proceed. For LOW, ask if they want full triage or just a quick check.

Phase 2: Parallel Diagnosis (< 60 seconds)

Spawn 4 cheap agents IN PARALLEL to gather evidence fast. Use explore-light and ops agents (Haiku, 1x cost) — not Sonnet/Opus. Speed and cost matter here.

2.1: Health Check (explore-light)

subagent_type: "explore-light"
name: "health-check"
prompt: "Read CLAUDE.md for health endpoints and infrastructure details.
Hit every health endpoint listed. Also check:
- Production URL: {from CLAUDE.md or project-profile.md}
- Staging URL: {if configured}
Report pass/fail for each endpoint with response time and status code."

2.2: Recent Deploys (ops agent)

subagent_type: "ops"
name: "deploy-check"
prompt: "Check recent deployment activity:
1. git log --oneline -5 main (what recently shipped)
2. gh run list --branch main --limit 5 (CI status)
3. If Railway: railway status (or platform-appropriate command)
Report: what was the last deploy, when, did CI pass, any errors."

2.3: Error Signals (ops agent)

subagent_type: "ops"
name: "error-check"
prompt: "Check for error signals:
1. If deploy platform has logs: get last 50 lines, filter for ERROR/WARN/FATAL
2. If Sentry configured (check .env or CLAUDE.md): note DSN existence
3. If Docker running: docker ps --format table, check for unhealthy/restarting containers
4. Check if any services are expected but not running (from CLAUDE.md Local Dev Services)
Report: active errors, unhealthy services, recent error patterns."

2.4: Knowledge Base Lookup (ops agent)

subagent_type: "ops"
name: "kb-lookup"
prompt: "Search for similar past incidents:
1. Read .claude/qa-knowledge/incidents/ — any files mentioning {affected area keywords}
2. Read .claude/qa-knowledge/bug-patterns.md — any patterns matching {symptoms}
3. Read .claude/knowledge/agents/sre.md — any past resolutions for similar issues
Report: similar past incidents with root causes and how they were resolved."

CRITICAL: Spawn all 4 in ONE message for maximum parallelism.

Phase 3: Correlate (analyze the evidence)

When all 4 agents report back, correlate their findings:

Decision Matrix

Health	Last Deploy	CI	Errors	Diagnosis	Route
DOWN	Recent (< 1hr)	Green	Deploy errors in logs	Bad deploy	Revert or `/fix`
DOWN	Recent (< 1hr)	Red	Test failures	Broken CI shipped	Revert, then `/ci-fix`
DOWN	None recent	—	Infra errors	Infrastructure failure	`/sre debug`
DOWN	None recent	—	No errors	External dependency	Check 3rd party status
UP	—	Red	CI errors	CI broken, prod OK	`/ci-fix` (not urgent)
UP	—	Green	App errors in logs	Code bug in prod	`/fix` with log context
UP	—	Green	No errors	Intermittent/resolved	Monitor, check if still happening
SLOW	Recent	—	Timeout errors	Perf regression from deploy	`/sre debug` + `/fix`
SLOW	None recent	—	DB slow queries	Database performance	`/sre debug` (data focus)
N/A	—	Red	—	CI failure only	`/ci-fix`
N/A	—	—	Local errors	Local dev issue	ops agent or `/restart`

Correlation Output

DIAGNOSIS:
  Severity: {CRITICAL/HIGH/MEDIUM/LOW}
  Type: {infrastructure/CI/code bug/performance/data/auth}
  Evidence:
    Health: {UP/DOWN/SLOW} — {details}
    Last deploy: {time} — {commit message}
    CI: {GREEN/RED} — {details}
    Errors: {summary}
    Similar past incident: {if found}

  Root cause hypothesis: {what we think happened based on correlation}

  Recommended action: {specific skill to invoke}

  Preliminary confidence: {HIGH (4 signals aligned) / MEDIUM (3 signals aligned) / LOW (2 or fewer signals aligned)}
  Dissenting signals: {list any signals that don't fit the primary hypothesis, or "none"}

Confidence is computed from how many of the 4 diagnostic signals (Health, Last Deploy, CI, Errors) align with the matched Decision Matrix row:

4 signals aligned = HIGH confidence
3 signals aligned = MEDIUM confidence
2 or fewer signals aligned = LOW confidence

Phase 3b: Competing Hypothesis Challenge

Before routing to a resolution skill, subject the primary diagnosis to adversarial challenge. 90 seconds of verification prevents 30 minutes of wrong-path investigation.

Mode Selection

Severity	Confidence	Mode	Time Budget	Agents	Condition
CRITICAL	HIGH	Skip	0s	None	Unambiguous evidence — act now
CRITICAL	MEDIUM or LOW	Fast Challenge	30s	1 Haiku (qa-challenger)	Quick sanity check before committing
HIGH	Any	Full Challenge	90s	2 Haiku agents	Worth 90s to avoid 30min wrong-path
MEDIUM	Any	Full Challenge	90s	2 Haiku agents	Same as HIGH
LOW	Any	Full Challenge	120s	2 Haiku agents	Extra time, low urgency

Fast Challenge Mode (1 agent, 30s)

Spawn a single qa-challenger subagent as devil's advocate:

subagent_type: "qa-challenger"
model: haiku
name: "devil-advocate-fast"
prompt: "You are a devil's advocate reviewing an incident diagnosis.

Incident: {original description}

Evidence gathered:
  Health: {Phase 2.1 result}
  Last deploy: {Phase 2.2 result}
  CI status: {Phase 2.2 CI result}
  Error signals: {Phase 2.3 result}
  KB matches: {Phase 2.4 result}

Primary diagnosis: {hypothesis from Phase 3}
Recommended action: {skill from Phase 3}

Your job: in 30 seconds, challenge this diagnosis.
- Is there an alternative explanation that fits the evidence better?
- Is any key evidence being ignored?
- Could this be a different failure mode?

Output exactly:
CHALLENGE: {alternative hypothesis} OR NONE — diagnosis looks correct
CONFIDENCE: HIGH / MEDIUM / LOW
KEY EVIDENCE: {the piece of evidence that drives your challenge, or n/a}"

Orchestrator enforces the 30-second time budget. If no response in 30s, proceed with primary hypothesis.

Full Challenge Mode (2 agents, 90-120s)

Create an agent team for cross-domain adversarial review:

TeamCreate: "incident-challenge-{timestamp}"

Agent 1 — devil-advocate:
  subagent_type: "qa-challenger"
  model: haiku
  role: Challenges primary hypothesis, proposes alternatives

Agent 2 — alt-hypothesis:
  subagent_type: "sre"   (if primary routes to /fix or /ci-fix)
               OR "ops"  (if primary routes to /sre debug — always the OPPOSITE domain)
  model: haiku
  role: Cross-domain verification of the challenge

SendMessage flow (max 2 hops):

Hop 1 — devil-advocate → alt-hypothesis:

"Challenge: Primary diagnosis says {primary hypothesis} but I want your cross-domain view.

Evidence:
  Health: {Phase 2.1 result}
  Last deploy: {Phase 2.2 result}
  CI status: {Phase 2.2 CI result}
  Error signals: {Phase 2.3 result}

My alternative hypothesis: {alternative — or 'I agree with primary if no better alternative'}

What do you see from your domain perspective?"

Hop 2 — alt-hypothesis → devil-advocate:

"Verdict: {AGREE-PRIMARY | AGREE-CHALLENGE | THIRD-HYPOTHESIS}
Reasoning: {brief — max 2 sentences}
Additional evidence: {any cross-domain signal that supports your verdict}"

devil-advocate → team lead (orchestrator): final positions summary:

CHALLENGE SUMMARY:
  devil-advocate position: {primary is correct / alternative: X}
  alt-hypothesis verdict: {AGREE-PRIMARY / AGREE-CHALLENGE / THIRD-HYPOTHESIS: X}
  Key disagreement: {what they disagreed on, or "none — consensus reached"}

Urgency Override

Phase 3c: Verdict Gate

Compute a confidence score from Phase 3b results. This score determines whether to proceed, merge, or escalate.

Confidence Score Computation

Start at 1.0 and apply adjustments:

Signal	Adjustment
DA says "NONE — diagnosis looks correct"	+0.0 (stay at 1.0)
DA proposes alternative with HIGH confidence	-0.4
DA proposes alternative with MEDIUM confidence	-0.2
DA proposes alternative with LOW confidence	-0.1
alt-hypothesis AGREES with challenge (not primary)	-0.3
alt-hypothesis proposes THIRD hypothesis	-0.2
alt-hypothesis AGREES with primary	+0.1 (cap at 1.0)
KB found matching past incident (Phase 2.4)	+0.1 (cap at 1.0)
Fast Challenge mode (only 1 agent ran)	No adjustment — use score as-is
Skip mode (CRITICAL + HIGH confidence)	Score = 1.0 by definition

Verdict Paths

Score	Verdict	Action
>= 0.8	PROCEED	Continue with primary hypothesis unchanged
0.5 – 0.79	MERGE	Incorporate challenger insights into the fix context. Document fallback hypothesis for Phase 4 agents. Proceed with primary but stay alert for signs of the alternative.
< 0.5	ESCALATE	Do not auto-route. Present competing hypotheses to user.

ESCALATE output format:

Competing hypotheses — your call:

[A] Primary: {primary hypothesis}
    Evidence for: {supporting signals}
    Confidence: {score before challenge}

[B] Challenger: {alternative hypothesis}
    Evidence for: {challenger's key evidence}
    From: {da / alt-hypothesis / both}

[C] Investigate both — run parallel diagnosis targeting both hypotheses

Which path? (A / B / C)

Verdict Output

VERDICT GATE:
  Challenge mode: {skip / fast / full}
  Primary hypothesis: {diagnosis}
  Challenger finding: {agreed / alternative: X / third hypothesis: X}
  Confidence score: {0.0–1.0}
  Verdict: {PROCEED / MERGE / ESCALATE}
  Fallback hypothesis: {if MERGE — what to watch for during resolution}

Phase 4: Route to Resolution

Based on the diagnosis, invoke the appropriate skill. Do NOT ask the user which skill to use. The correlation already determined it.

4.1: Infrastructure Issue → `/sre debug`

Invoke /sre debug with context:
- Description: {original description}
- Health check results: {from Phase 2.1}
- Deploy history: {from Phase 2.2}
- Error logs: {from Phase 2.3}
- Similar incidents: {from Phase 2.4}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

The SRE agent gets all the context from parallel diagnosis — doesn't need to re-discover.

4.2: CI Failure → `/ci-fix`

Invoke /ci-fix with context:
- Mode: ci (or staging/prod based on diagnosis)
- Branch: {from deploy check}
- Known patterns: {from KB lookup}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

4.3: Code Bug → `/fix`

Invoke /fix with context:
- Description: {original description + diagnosis}
- Severity: {from Phase 1}
- Error logs: {from Phase 2.3 — gives the agent a head start on root cause}
- Similar incidents: {from Phase 2.4 — may identify root cause immediately}
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

If an issue number was provided (/incident #234), pass it: /fix #234 --severity {classified severity}

4.4: Performance Issue → `/sre debug` with perf focus

Invoke /sre debug with:
- Description: "Performance degradation: {description}"
- Focus: "Use RED method (Rate/Errors/Duration) for API endpoints.
  Check: database query times, connection pool, cache hit rates, memory usage."
- Challenge results: {verdict from Phase 3c — PROCEED/MERGE/ESCALATE/N·A (challenge skipped) and challenger summary}
- Fallback hypothesis: {if MERGE verdict — alternative hypothesis to watch for during investigation}

4.5: Local Ops → `/restart` or ops agent

For local issues (services not starting, local errors):
- Invoke /restart skill
- Or spawn ops agent for specific task

4.6: Unknown / Ambiguous → Full `/sre status` then decide

If correlation is inconclusive:
1. Run /sre status for complete system overview
2. Present findings to user
3. Ask: "Based on this, it looks like {hypothesis}. Should I proceed with {skill}?"

Phase 5: Post-Resolution

After the routed skill completes its full lifecycle. Important: different skills have different post-implementation flows — /incident must wait for the ENTIRE flow, not just the fix itself.

Coordination with routed skill lifecycle

Routed to	What that skill does after fixing	When /incident Phase 5 starts
`/fix`	Step 6: QA → restart servers → user tests → `/pr --skip-qa` → PR created	After PR is created (Step 6.4 completes)
`/ci-fix`	Retries CI, may push fixes	After CI passes or exhausts retries
`/sre debug`	Infra fix (no code) → verify health	Immediately after health verified
`/sre debug` → escalates to `/fix`	Same as `/fix` row above	After `/fix`'s full lifecycle completes
`/restart`	Restarts services → verify health	Immediately after health verified

5.1: Verify resolution

# Re-run the health checks from Phase 2.1
# Compare: were the failing endpoints now passing?
# Compare: are the error signals gone?

After infra fix, quick sanity check:
1. Hit all health endpoints (from CLAUDE.md Environments table)
2. Check error signals have cleared
3. If config was changed: ask user to smoke-test manually

5.2: Write incident report

Create .claude/qa-knowledge/incidents/{date}-{slug}.md:

---
status: resolved
severity: {severity}
type: {type}
affected: {services/endpoints}
duration: {time from report to resolution}
root_cause: {from the skill that fixed it}
resolved_by: {/sre, /ci-fix, /fix, /restart}
---

## Timeline
- {time}: Incident reported — "{original description}"
- {time}: Parallel diagnosis — {what was found}
- {time}: Routed to {skill} — {diagnosis}
- {time}: Resolution — {what was done}
- {time}: Verified — {health checks pass}

## Root Cause
{from the resolving skill}

## Prevention
{what would prevent this from happening again}

## Similar Past Incidents
{from Phase 2.4 KB lookup}

## Challenge Results
- Challenge mode: {skip / fast / full}
- Primary hypothesis: {diagnosis from Phase 3}
- Challenger finding: {agreed / alternative proposed: X / third hypothesis: X}
- Verdict: {PROCEED / MERGE / ESCALATE}
- Confidence score: {0.0–1.0}

5.3: Update knowledge base

Append to .claude/knowledge/agents/sre.md: incident summary + resolution
If new pattern discovered: append to .claude/qa-knowledge/bug-patterns.md
If the diagnosis was wrong (routed to wrong skill): note the misclassification so future triage is more accurate
Append to .claude/knowledge/skills/incident.md: challenge effectiveness data — whether the challenger caught a real issue, whether the primary hypothesis was correct, and the final confidence score. This builds a record of when challenge modes add value vs. when they confirm the obvious.

5.4: Notify team

If Discord MCP is available:

Send to #deployments:
  "Incident resolved — {severity} {type}
   Root cause: {summary}
   Fix: {what was done}
   Duration: {time}"

5.5: What's next

After resolution is verified and reported:

Incident resolved. Duration: {time}.

What's next?
  [1] Monitor — keep watching for recurrence (/sre health in 5 min)
  [2] Fix another issue — /fix #N or /incident <description>
  [3] See project status — /onboard
  [4] Done for now

For CRITICAL/HIGH severity, default to option 1 (monitor) and suggest running /sre health after 5-10 minutes to confirm the fix holds.

Auto-Detect Mode (`/incident` with no args)

When invoked without a description, run Phase 2 checks proactively:

Hit all health endpoints
Check CI status
Check recent deploys
Check for error signals

If everything is green:

All systems healthy:
  ✓ Health endpoints: all responding
  ✓ CI: last 3 runs green
  ✓ Deploy: last deploy {time ago}, healthy
  ✓ No error signals detected

Nothing to triage. What prompted the check?

If something is wrong, proceed to Phase 3 correlation, Phase 3b challenge, Phase 3c verdict, and Phase 4 routing automatically.

Escalation Rules

Condition	Escalation
/sre debug finds a code bug	→ Escalate to `/fix` with SRE's findings as context
/ci-fix exhausts 3 attempts	→ Escalate to `/fix` (deeper investigation needed)
/fix finds an infra issue (not code)	→ Escalate back to `/sre debug`
Any skill fails to resolve in 15 min	→ Alert on Discord, present options to user
CRITICAL severity not resolved in 30 min	→ Suggest revert: `git revert HEAD && git push`

Integration with Existing Skills

Skill	How /incident uses it
`/sre debug`	Routed to for infrastructure issues, gets pre-gathered context
`/ci-fix`	Routed to for CI failures, gets branch and pattern context
`/fix`	Routed to for code bugs, gets error logs and similar incidents
`/restart`	Routed to for local ops issues
`/qa`	Run after resolution to verify no regressions
explore-light	Phase 2 health checks (1x cost)
ops agent	Phase 2 deploy/error/KB checks (1x cost)
`/review-pr`	After /fix creates a PR, review it before merge

Cost Model

By Phase

Phase	Agents	Cost
Phase 1: Classify	None (pattern matching)	0
Phase 2: Diagnose	4 Haiku agents in parallel	4x
Phase 3: Correlate	None (analysis)	0
Phase 3b: Challenge (skip)	None	0
Phase 3b: Challenge (fast)	1 Haiku agent	1x
Phase 3b: Challenge (full)	2 Haiku agents	2x
Phase 4: Route	1 Sonnet agent (SRE/fix/ci-fix)	10x
Phase 5: Verify	1 Haiku agent (health check)	1x

By Severity (total cost including challenge)

Severity	Challenge Mode	Total Cost	Notes
CRITICAL + HIGH confidence	Skip	~15x	No challenge overhead
CRITICAL + med/low confidence	Fast (1 Haiku)	~16x	+1x for 30s sanity check
HIGH	Full (2 Haiku)	~17x	+2x for 90s full challenge
MEDIUM	Full (2 Haiku)	~17x	Same as HIGH
LOW	Full (2 Haiku)	~17x	120s budget, same agent cost

Rules

CRITICAL = no questions. Diagnose and act immediately.
Parallel diagnosis is mandatory. Never diagnose sequentially — 4 agents at once.
Evidence before routing. Don't guess which skill to use — correlate first.
Challenge before routing (unless CRITICAL + HIGH confidence). 90 seconds of verification prevents 30 minutes of wrong-path investigation.
Challenge uses Haiku. Never escalate challenge agents to Sonnet/Opus. Speed and cost matter — Haiku is sufficient for adversarial review.
Escalation is automatic. If /sre finds a code bug, it routes to /fix without asking.
Every incident gets a report. Even if resolved in 30 seconds.
Knowledge base is always checked. Similar past incidents save diagnosis time.
Verification after resolution. Re-run health checks to confirm the fix worked.
Notify stakeholders. Communication channels are checked and used at every phase.

Stakeholder Communication

Incidents don't happen in a vacuum. Stakeholders need updates throughout — not just at the end.

Communication Channels (auto-detect from project)

Check these sources to find configured channels:

Signal	Channel	How to use
Discord MCP in `.claude/settings.json`	Discord	`mcp__discord-mcp__send-message(channel, message)`
`slack@claude-plugins-official` enabled	Slack	Slack plugin send message
`SLACK_WEBHOOK_URL` in env	Slack webhook	`curl -X POST -d '{"text":"..."}' $SLACK_WEBHOOK_URL`
`DISCORD_WEBHOOK_URL` in env	Discord webhook	`curl -X POST -d '{"content":"..."}' $DISCORD_WEBHOOK_URL`
`PAGERDUTY_*` in env	PagerDuty	PagerDuty API for escalation
`OPSGENIE_*` in env	OpsGenie	OpsGenie API for alerting
GitHub issue exists for the incident	GitHub	Comment on the issue with updates
`TEAMS_WEBHOOK_URL` in env	Microsoft Teams	Teams webhook API
Notion MCP configured	Notion	Create/update incident page
Linear MCP configured	Linear	Create/update incident issue

On first run, detect which channels are available and store in knowledge base.

Communication Protocol

Phase	Who to notify	What to say	Channel
Phase 1 (Classify)	On-call / team	"Incident detected: {severity} — {description}. Investigating."	Discord/Slack #incidents
Phase 3 (Correlate)	On-call / team	"Diagnosis: {type} — {hypothesis}. Routing to {skill}."	Discord/Slack #incidents
Phase 4 (mid-fix)	Stakeholders if CRITICAL	"Update: root cause identified — {cause}. Fix in progress. ETA: {estimate}."	Discord/Slack #incidents + PagerDuty
Phase 5 (Resolved)	Everyone	"Resolved: {root cause}. Fix: {what was done}. Duration: {time}."	Discord/Slack #incidents + GitHub issue
Post-incident	Team lead	Full incident report with timeline, root cause, prevention	Knowledge base + Notion/Linear

Stakeholder Input During Incident

If the user (or a stakeholder via Discord/Slack) provides additional context during the incident, incorporate it:

User: "actually it only affects users in EU"
  → Update scope: EU region only
  → Check: is there a regional configuration difference?
  → Update diagnosis with this context

User: "we just deployed a config change 10 minutes ago"
  → Highest suspect: the config change
  → Check the config diff
  → Skip broad diagnosis, focus on config

User: "don't revert, we need that deploy"
  → Respect the constraint
  → Find a forward fix instead of revert
  → Update incident notes with the constraint

Data Source Integration

Automatic Data Sources (always available)

Source	What it provides	How to access
Git history	Recent commits, who changed what	`git log`, `git blame`
GitHub/GitLab	CI status, PRs, issues, deployments	`gh` CLI
CLAUDE.md	Health endpoints, service topology	Read tool
Docker	Container health, resource usage	`docker ps`, `docker stats`
Project logs	Application errors, access logs	Platform-specific CLI

MCP-Connected Data Sources (if configured)

MCP Server	What it unlocks
Postgres MCP	Direct DB queries — check connection count, slow queries, table sizes
Redis MCP	Cache hit rates, memory usage, connected clients
MongoDB MCP	Collection stats, slow operations, replica set health
Sentry MCP	Error rates, affected users, stack traces, release health
Datadog MCP	APM traces, infrastructure metrics, log patterns
AWS MCP	CloudWatch metrics, ECS task health, RDS stats, Lambda errors
GCP MCP	Cloud Monitoring, Error Reporting, Trace
Cloudflare MCP	Edge analytics, WAF events, origin health
PagerDuty MCP	Active incidents, on-call schedule

Recommending Data Sources

If diagnosis is limited by missing data sources:

Incident triage limited — missing data sources:

⚠ No monitoring connected — can't check error rates or performance metrics
  Recommend: Sentry MCP (free tier) or Datadog MCP
  /calibrate can install this automatically

⚠ No database MCP — can't check query performance or connection health
  Recommend: Postgres MCP / Redis MCP based on your stack
  /calibrate can install this automatically

⚠ No alerting configured — team won't be notified of future incidents
  Recommend: PagerDuty or OpsGenie integration

These recommendations feed back to /calibrate — next time it runs, it includes incident-driven recommendations alongside the standard ones.

Adaptive Learning

Learn from every incident

After resolution, analyze the incident for patterns:

# What to learn:
1. Classification accuracy — did we route to the right skill?
   If /sre was invoked but it turned out to be a code bug → improve type classification

2. Diagnosis speed — which of the 4 parallel checks found the answer?
   If KB lookup found a matching past incident → that was the fastest path
   If health check was the key signal → health checks are working well

3. Resolution effectiveness — did the routed skill fix it?
   If /fix resolved it → code bug pattern, add to bug-patterns.md
   If /sre resolved it → infra pattern, add to sre knowledge
   If manual intervention was needed → gap in automation

4. Time to resolution — how long from report to verified fix?
   Track per severity and type for trending

Write learning to knowledge base

After every incident, append to .claude/knowledge/skills/incident.md:

### {date} — {title} ({severity}, {type})
**Classified as**: {type} — {was this correct? yes/no}
**Routed to**: {skill}
**Root cause**: {summary}
**Resolution time**: {duration}
**Key signal**: {which Phase 2 check found the answer}
**Knowledge gap**: {what we didn't know that slowed us down}
**New pattern**: {if this is a new type of incident, describe it}
**Similar past**: {count of similar incidents — is this recurring?}

Pattern Detection

After 5+ incidents are logged, check for patterns:

Incident patterns detected:
  ⚠ 3 incidents in backend/app/services/auth.py in last 30 days
    → Consider: comprehensive auth service refactor
  ⚠ 2 CI failures from dependency updates in last 2 weeks
    → Consider: pin dependencies, add lockfile check
  ⚠ Database slow query incidents increasing
    → Consider: add query monitoring, review indexes

Surface these in /onboard briefing so the team sees systemic issues, not just individual incidents.

Improve classification over time

Read .claude/knowledge/skills/incident.md at the start of every incident. If past incidents show that certain keywords were misclassified:

# From knowledge base:
# "auth timeout" was classified as INFRA but was actually CODE BUG (3 times)
# → Override: "auth.*timeout" → CODE BUG, not INFRA

The classification tables in Phase 1 are defaults. Knowledge base corrections override them for this specific project.

/calibrate Integration

What /calibrate contributes to /incident

When /calibrate runs, it discovers:

Deploy platform → /incident knows where to check logs
Monitoring tools → /incident knows which MCP servers to query
CI platform → /incident knows how to check CI status
Health endpoints → /incident knows what to hit first
Communication channels → /incident knows where to send alerts

What /incident contributes to /calibrate

After incidents, /incident recommends:

Missing monitoring → "add Sentry MCP"
Missing alerting → "add PagerDuty integration"
Missing health endpoints → "add /health to service X"
Missing data sources → "add Postgres MCP for query diagnostics"

These get included in /calibrate's recommendations next time it runs.

Project Profile Usage

If .claude/project-profile.md exists:

Read infrastructure section → skip platform detection
Read external integrations → know which MCP servers are available
Read domain model → understand which services are critical vs auxiliary

If it doesn't exist:

Do full discovery (Phase 2 handles this)
Recommend running /calibrate

Workflow: Full Incident Lifecycle

Something is wrong
    │
    ▼
/incident "{description}"
    │
    ├── Phase 1: CLASSIFY (instant)
    │   Severity: CRITICAL / HIGH / MEDIUM / LOW
    │   Type: infra / CI / code bug / perf / data / auth
    │   → Notify team: "Incident detected, investigating"
    │
    ├── Phase 2: PARALLEL DIAGNOSIS (4 agents, < 60s)
    │   ├── Health endpoints (explore-light, 1x)
    │   ├── Recent deploys + CI (ops, 1x)
    │   ├── Error signals + logs (ops, 1x)
    │   └── Knowledge base lookup (ops, 1x)
    │   → Notify team: "Diagnosis complete, routing"
    │
    ├── Phase 3: CORRELATE
    │   Health × Deploy × CI × Errors = Diagnosis
    │   → Hypothesis + preliminary confidence (HIGH/MEDIUM/LOW)
    │
    ├── Phase 3b: CHALLENGE (adversarial swarm)
    │   ├── CRITICAL + HIGH conf → Skip (0s)
    │   ├── CRITICAL + med/low  → Fast: 1 Haiku devil's advocate (30s)
    │   └── HIGH/MEDIUM/LOW     → Full: 2 Haiku agents, 2-hop debate (90-120s)
    │       ├── devil-advocate: challenges primary hypothesis
    │       └── alt-hypothesis: cross-domain verification
    │
    ├── Phase 3c: VERDICT GATE
    │   Confidence score → PROCEED / MERGE / ESCALATE
    │   → If ESCALATE: present options [A] primary [B] challenger [C] both
    │
    ├── Phase 4: ROUTE + RESOLVE
    │   ├── Infra → /sre debug (with all Phase 2 context)
    │   ├── CI → /ci-fix (with branch + patterns)
    │   ├── Code bug → /fix (with error logs + similar incidents)
    │   ├── Perf → /sre debug (perf focus)
    │   └── Local → /restart or ops agent
    │   → Notify team: "Root cause: {X}. Fix in progress."
    │   → Accept stakeholder input: constraints, context, steering
    │
    ├── Phase 5: VERIFY + REPORT
    │   ├── Re-run health checks
    │   ├── Write incident report
    │   ├── Update knowledge base
    │   └── Notify team: "Resolved. Duration: {X}."
    │
    └── LEARN
        ├── Was classification correct?
        ├── Which diagnosis signal was key?
        ├── Is this a recurring pattern?
        └── What data sources were missing?

/incident

/incident

/incident — Incident Triage Orchestrator

When to Use

Argument Parsing

Phase 1: Classify (instant — no agents, just pattern matching)

Severity Classification

Type Classification

Output after classification:

Phase 2: Parallel Diagnosis (< 60 seconds)

2.1: Health Check (explore-light)

2.2: Recent Deploys (ops agent)

2.3: Error Signals (ops agent)

2.4: Knowledge Base Lookup (ops agent)

Phase 3: Correlate (analyze the evidence)

Decision Matrix

Correlation Output

Phase 3b: Competing Hypothesis Challenge

Mode Selection

Fast Challenge Mode (1 agent, 30s)

Full Challenge Mode (2 agents, 90-120s)

Urgency Override

Phase 3c: Verdict Gate

Confidence Score Computation

Verdict Paths

Verdict Output

Phase 4: Route to Resolution

4.1: Infrastructure Issue → /sre debug

4.2: CI Failure → /ci-fix

4.3: Code Bug → /fix

4.4: Performance Issue → /sre debug with perf focus

4.5: Local Ops → /restart or ops agent

4.6: Unknown / Ambiguous → Full /sre status then decide

Phase 5: Post-Resolution

Coordination with routed skill lifecycle

5.1: Verify resolution

5.2: Write incident report

5.3: Update knowledge base

5.4: Notify team

5.5: What's next

Auto-Detect Mode (/incident with no args)

Escalation Rules

Integration with Existing Skills

Cost Model

By Phase

By Severity (total cost including challenge)

Rules

Stakeholder Communication

Communication Channels (auto-detect from project)

Communication Protocol

Stakeholder Input During Incident

Data Source Integration

Automatic Data Sources (always available)

MCP-Connected Data Sources (if configured)

Recommending Data Sources

Adaptive Learning

Learn from every incident

Write learning to knowledge base

Pattern Detection

Improve classification over time

/calibrate Integration

What /calibrate contributes to /incident

What /incident contributes to /calibrate

Project Profile Usage

Workflow: Full Incident Lifecycle

/incident — Incident Triage Orchestrator

When to Use

Argument Parsing

Phase 1: Classify (instant — no agents, just pattern matching)

Severity Classification

Type Classification

Output after classification:

Phase 2: Parallel Diagnosis (< 60 seconds)

2.1: Health Check (explore-light)

2.2: Recent Deploys (ops agent)

2.3: Error Signals (ops agent)

2.4: Knowledge Base Lookup (ops agent)

Phase 3: Correlate (analyze the evidence)

Decision Matrix

Correlation Output

4.1: Infrastructure Issue → `/sre debug`

4.2: CI Failure → `/ci-fix`

4.3: Code Bug → `/fix`

4.4: Performance Issue → `/sre debug` with perf focus

4.5: Local Ops → `/restart` or ops agent

4.6: Unknown / Ambiguous → Full `/sre status` then decide

Auto-Detect Mode (`/incident` with no args)

4.1: Infrastructure Issue → `/sre debug`

4.2: CI Failure → `/ci-fix`

4.3: Code Bug → `/fix`

4.4: Performance Issue → `/sre debug` with perf focus

4.5: Local Ops → `/restart` or ops agent

4.6: Unknown / Ambiguous → Full `/sre status` then decide

Auto-Detect Mode (`/incident` with no args)