npx claudepluginhub houseofmvps/ultraship --plugin ultraship<url-or-error-description>This skill uses the workspace's default tool permissions.
When production is down, every minute costs trust. This skill runs an incident like a principal SRE — fast triage, clear decision-making, structured recovery, and prevention so it never happens again.
Diagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Executes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.
Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
Share bugs, ideas, or general feedback.
When production is down, every minute costs trust. This skill runs an incident like a principal SRE — fast triage, clear decision-making, structured recovery, and prevention so it never happens again.
Before doing anything, classify the incident:
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Service completely down, all users affected | Immediately | Site returns 500, database unreachable |
| SEV-2 | Major feature broken, many users affected | Within 15 min | Auth broken, payments failing, data loss |
| SEV-3 | Minor feature broken, some users affected | Within 1 hour | One API endpoint slow, email not sending |
| SEV-4 | Cosmetic or edge case | Next business day | UI glitch on one browser, non-critical error log |
Severity determines urgency. SEV-1/2: restore first, investigate later. SEV-3/4: investigate first, then fix.
Ask for:
If the user is panicking, skip questions and use whatever info is available. Speed > completeness for SEV-1.
node ${CLAUDE_PLUGIN_ROOT}/tools/incident-commander.mjs <project-directory> --url=<production-url>
Parse the JSON output.
Present findings in order of urgency:
Site Status:
Likely Culprit:
Error Patterns Found:
Resource Issues:
Present in order of speed — for SEV-1/2, always recommend Option 1 first:
Option 1: Rollback (fastest — 2-5 min)
git revert <culprit-hash> --no-edit && git push
This is almost always the right first move. Restore service, then investigate.
When NOT to rollback:
Option 2: Hot Fix (5-15 min) If the error pattern is clear and the fix is small:
fix: [what was broken] — incident [date]Option 3: Traffic Management (immediate) If the issue is load-related:
Option 4: Investigate Further If the cause isn't clear:
railway logs, Vercel: function logs)After applying a fix:
node ${CLAUDE_PLUGIN_ROOT}/tools/health-check.mjs <production-url>
Confirm the site is back to healthy status. Check:
For SEV-1/2, the user needs to communicate with their users:
Status page update template:
[Investigating] We're aware of [issue description] and are actively working on a fix.
[Identified] We've identified the cause and are deploying a fix.
[Resolved] The issue has been resolved. [Brief explanation]. We apologize for the disruption.
If the user has a status page: help them post the update. If they don't: suggest setting up a simple one (Instatus, Betteruptime, or a static page).
Generate a post-mortem document from the incident-commander output:
# Incident Post-Mortem — [Date]
## Summary
- **What happened:** [One sentence]
- **Severity:** SEV-[N]
- **Duration:** [start time] to [end time] ([N] minutes)
- **Impact:** [Who was affected, what they experienced]
- **Root cause:** [One sentence]
## Timeline
| Time | Event |
|---|---|
| HH:MM | Issue detected (how: monitoring/user report/deploy) |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |
## Root Cause Analysis
[Detailed explanation of what went wrong and why]
## What Went Well
- [Fast detection, quick recovery, etc.]
## What Went Wrong
- [Missed in review, no test coverage, no monitoring, etc.]
## Action Items
| Action | Priority | Owner | Deadline |
|---|---|---|---|
| Add test for this failure case | High | [user] | This week |
| Add monitoring for [pattern] | High | [user] | This week |
| Add pre-deploy check that would have caught this | Medium | [user] | This sprint |
| [Update runbook/docs] | Low | [user] | This month |
Save to docs/incidents/YYYY-MM-DD-incident.md.
Based on the incident, suggest concrete preventive measures:
Immediate (today):
/ship pre-deploy auditThis week:
/health or /api/health)This month: