Structured production incident triage using a four-phase workflow (hypothesis → reproduce → isolate → fix + verify) with service-context gathering and a durable RCA summary artifact.
From workflow-orchestrationnpx claudepluginhub mikecubed/agent-orchestration --plugin workflow-orchestrationThis skill uses the workspace's default tool permissions.
Provides JPA/Hibernate patterns for Spring Boot entity design, relationships, N+1 prevention, query optimization, transactions, auditing, pagination, indexing, and pooling.
Manages knowledge base ingestion, sync, organization, deduplication, and retrieval across local files, MCP memory, vector stores, Git repos, and Supabase. Useful for saving, searching, or updating knowledge systems.
Retrieves, analyzes, and updates Jira tickets via MCP server or direct REST API. For understanding requirements, adding comments, transitioning status, and linking git branches.
Use this skill when a developer or on-call responder needs a structured process for triaging a production incident — a live alert, service degradation, SLA breach, or customer-impacting failure — and driving it from symptom through root cause to verified fix.
This skill is not for general-purpose debugging. For bugs discovered during development, test failures, or non-production issues, use systematic-debugging instead. incident-rca is specifically designed for situations where a production system is actively degraded: it adds a service-context gathering phase (metrics, logs, traces, deployment history) before hypothesis formation, and it produces a durable RCA summary artifact suitable for post-incident review.
Persistent team, squad, or fleet-style long-lived orchestration is out of scope for this skill. Use a separate orchestration layer if persistent coordination is needed.
Activate when the developer or on-call engineer describes a live production problem:
Also activate when:
Do not activate for:
systematic-debugging);architecture-review);e2e-test-generation).Before you start, identify:
max-failed-hypotheses — number of failed hypotheses before narrowing scope (default: 3);.agent/rca-summary.md).If any critical inputs are missing, ask the developer before proceeding.
Use separate roles for:
The scout produces a factual context brief of the incident: what is failing, what changed recently, what the observability data shows. The implementer and reviewer consume this brief independently.
Resolve the active model for each role using this priority chain:
Project config — look for the runtime-specific config file in the current project root:
.copilot/models.yaml.claude/models.yamlRead the implementer, reviewer, and scout keys directly. If a key is absent, fall back to the baked-in default for that role.
Session cache — if models were already confirmed earlier in this session, reuse them.
Baked-in defaults — if neither config file nor session cache exists, use the defaults below, ask the developer to confirm or override once, then cache for the session.
| Runtime | Role | Default model |
|---|---|---|
| Copilot CLI | Implementer | claude-opus-4.6 |
| Copilot CLI | Reviewer | gpt-5.4 |
| Copilot CLI | Scout | claude-haiku-4.5 |
| Claude Code | Implementer | claude-opus-4.6 |
| Claude Code | Reviewer | claude-opus-4.6 |
| Claude Code | Scout | claude-haiku-4.5 |
Run a scout pass to gather the factual context brief for the incident:
Using the gathered context, list candidate root causes ranked by likelihood. Present the ranked list to the developer and confirm the most likely hypothesis to investigate first.
Gate: Hypothesis confirmed
Write .agent/SESSION.md with current-phase: "hypothesis" after this phase completes.
Attempt to reproduce the failure in a safe environment (staging, canary, or local).
## Failed Hypotheses and return to Phase 1 for the next hypothesis.Gate: Reproduction
Write .agent/SESSION.md with current-phase: "reproduce" after this phase completes.
Narrow the failure to a specific component, service, function, or code path.
## Failed Hypotheses.max-failed-hypotheses attempts, apply the rescue policy: narrow scope to the next-largest suspect component and retry.Gate: Root cause isolated
Write .agent/SESSION.md with current-phase: "isolate" after this phase completes.
Gate: Fix verified
Write .agent/SESSION.md with current-phase: "fix-verified" after this phase completes.
At the end of every invocation — whether the workflow completes fully, stops early, or is interrupted — produce a durable RCA summary written to the confirmed artifact path. This artifact must contain:
"Durable" means written to a repository-appropriate sink — a committed document, a PR comment, or an issue — not only to chat. Chat-only summaries do not satisfy this requirement.
Write .agent/SESSION.md using the full schema defined in docs/session-md-schema.md. All five YAML frontmatter fields are required on every write:
current-task: the incident descriptioncurrent-phase: the current phase namenext-action: what happens nextworkspace: the active branch or PR referencelast-updated: current ISO-8601 timestampRequired sections: ## Decisions, ## Files Touched, ## Open Questions, ## Blockers, ## Failed Hypotheses.
Write SESSION.md after each phase gate. If the write fails, log a warning and continue.
Each phase must satisfy its specific gate condition (see workflow above) before advancing to the next phase. A failed gate halts forward progress; the skill must either rescue or stop.
The durable RCA summary artifact must be written to disk at the end of every invocation. If both the primary path and the fallback path (.agent/rca-summary.md) fail, the gate fails.
Before declaring the incident RCA complete, confirm ALL of the following. Any failing item blocks the "RCA complete" declaration.
If any item is FAIL: report the failing item(s) by name, state what must be done to resolve each, and do not advance past the gate.
Before stopping, ensure any partial results are preserved as a durable artifact so work is not lost. A partial RCA must still contain the symptom, evidence gathered, hypotheses attempted, and the reason for stopping.
Developer (on-call): We're seeing a 5xx spike on /api/payments — started about 2 hours ago
Phase 1 — Hypothesis
The scout gathers the factual context brief:
add_nullable_bank_code) was deployed 2 hours ago.NullPointerException in PaymentService.charge() at line 142, reading merchant.bankCode./api/payments correlated exactly with the migration deployment timestamp.payment-service only; upstream order-service sees elevated error rates from downstream.Candidate hypotheses (ranked):
merchant.bankCode nullable but PaymentService.charge() assumes non-null (most likely — correlates with migration timing and NPE location).Developer confirms H1 as the primary hypothesis.
Phase 2 — Reproduce
Run the migration on staging, then execute a test charge. PaymentService.charge() throws the same NullPointerException. Reproduction confirmed.
Phase 3 — Isolate
Root cause: PaymentService.charge() line 142 reads merchant.bankCode which is now nullable after the migration. The code path has no null guard. Evidence: stack trace, migration diff, schema before/after comparison.
Phase 4 — Fix + Verify
Fix: add a null guard on merchant.bankCode in PaymentService.charge() (default to the merchant's primary bank code from the accounts table). Add a migration rollback path that restores the NOT NULL constraint with a default value. Run the validation suite — all tests pass. Staging charge succeeds with the fix deployed.
# RCA — 5xx spike on /api/payments
**Symptom:** 5xx errors on POST /api/payments starting 2025-07-20T14:00Z.
**Hypothesis confirmed:** DB migration `add_nullable_bank_code` made `merchant.bankCode` nullable; `PaymentService.charge()` line 142 has no null guard.
**Evidence:** NullPointerException stack trace, migration timing correlation, staging reproduction.
**Fix:** Null guard + migration rollback path. PR #347.
**Validation:** All tests pass, staging charge succeeds.
**Follow-ups:** Add NOT NULL constraint back with default value; add null-safety lint rule for merchant fields.