Help us improve
Share bugs, ideas, or general feedback.
From build-like-amazon
Guides blameless post-incident analysis using timeline reconstruction, 5 Whys root cause analysis, and concrete action items with owners.
npx claudepluginhub robisson/build-like-amazon-agent-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/build-like-amazon:correction-of-errorsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A [Correction of Errors (COE)](https://www.youtube.com/watch?v=Prd2VvSo_p8) is a blameless post-incident document that analyzes what happened, why it happened, and what mechanisms will prevent recurrence. The goal is not to assign blame—it is to improve the system so the same class of failure cannot happen again. COEs produce mechanisms (automated safeguards), not promises ("we'll be more caref...
Conducts blameless postmortems for outages and incidents with timeline reconstruction, root cause analysis (5 Whys, fishbone), and corrective action tracking.
Guides writing blameless postmortems for SEV1/SEV2 incidents using templates, timelines, root cause analysis, and action items to foster learning.
Guides writing blameless postmortems for incident reviews, root cause analysis, and organizational learning.
Share bugs, ideas, or general feedback.
A Correction of Errors (COE) is a blameless post-incident document that analyzes what happened, why it happened, and what mechanisms will prevent recurrence. The goal is not to assign blame—it is to improve the system so the same class of failure cannot happen again. COEs produce mechanisms (automated safeguards), not promises ("we'll be more careful"). Every COE must result in concrete action items with owners and due dates that make the system more resilient.
Load agents/coe-reviewer.md when reviewing a COE. Use it to enforce blameless analysis, timeline accuracy, root-cause depth, mechanism quality, and concrete action items with owners and dates.
Amazon's COE process is one of its most powerful learning mechanisms. The document is written by the team that experienced the incident, reviewed by senior leadership, and shared broadly for organizational learning. The key insight: people don't cause failures—systems allow failures. If a single person's mistake can cause a customer-impacting incident, the system lacks sufficient safeguards. COEs that conclude "the engineer should have been more careful" are rejected. COEs that conclude "we will add an automated check that prevents this class of error" are accepted.
Start with the customer impact and ask "why" until you reach systemic root causes:
Root causes from this example:
Section 1: Summary
Section 2: Customer Impact
Section 3: Timeline
Section 4: Root Cause Analysis (5 Whys)
Section 5: Action Items
Section 6: Lessons Learned
| Intention | Mechanism |
|---|---|
| "We'll be more careful during code review" | Automated static analysis that detects the specific pattern that caused the incident |
| "We'll remember to check that next time" | Automated pre-deploy check that verifies the condition |
| "The on-call will know to look for this" | Alarm that automatically detects this condition and pages |
| "We'll test this scenario" | Integration test added to the pipeline that fails if this scenario recurs |
| "We'll document this for future reference" | Runbook entry with specific steps, linked to the relevant alarm |
| What They Say | Why It's Wrong | What To Do Instead |
|---|---|---|
| "It was a human error" | Humans always make errors. The question is: why did the system allow it to cause customer impact? | Identify the missing safeguard. Add automation that catches this error before it reaches production |
| "It was an edge case we couldn't predict" | If it happened, it can happen again. Every edge case needs a mechanism once discovered | Add the test case, add the alarm, add the guard. "Unpredictable" means "we hadn't thought about it yet" |
| "We were unlucky" | Luck is not an operational strategy. If a system can fail given bad luck, it will | Design for resilience. Assume failure will happen and build safeguards |
| "We need to train people better" | Training fades. People rotate. Knowledge is lost | Build the knowledge into the system: automated checks, guardrails, circuit breakers |
After corrective actions are defined, check whether any action item is an implementation-level corrective action (something that should change how code is written in future builds — not an org/process/people action). If so:
skills/implementation-memory/SKILL.md.docs/implementation-memory.md.If no implementation-level action item exists (all actions are org/process/infra), skip this step silently.