From harness-claude
Generates runbooks, postmortem analyses, and tracks SLOs/SLAs. Diagnoses incidents by tracing symptoms through services, produces structured postmortems, and maintains error budgets.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Runbook generation, postmortem analysis, and SLO/SLA tracking. Diagnoses incidents by tracing symptoms through services, produces structured postmortems, and maintains error budget accounting.
Manages SRE production incidents: assesses impact, establishes command, investigates via observability (Prometheus, OpenTelemetry, Grafana), conducts blameless post-mortems, handles error budgets.
Orchestrates multi-agent incident response workflows using SRE practices for detection, triage, observability analysis, mitigation, communication, resolution, and blameless postmortems.
Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
Share bugs, ideas, or general feedback.
Runbook generation, postmortem analysis, and SLO/SLA tracking. Diagnoses incidents by tracing symptoms through services, produces structured postmortems, and maintains error budget accounting.
Identify the incident signal. Scan available evidence to determine what triggered the investigation:
docs/incidents/ or docs/postmortems/git log --oneline --since="48 hours ago" for correlated changesMap affected services. Trace the blast radius from the incident signal:
docker-compose.yml, kubernetes/, service mesh configs)Classify severity. Apply the project's severity matrix if one exists in docs/runbooks/severity-matrix.md. Otherwise, use standard classification:
Establish timeline boundaries. Determine:
Check for existing runbooks. Search docs/runbooks/ and runbooks/ for procedures matching the affected service or failure mode. If a runbook exists, evaluate whether it was followed and whether it was effective.
Correlate with recent changes. Run git log --oneline --since="7 days ago" and cross-reference commits with the incident timeline. Flag commits that touched affected services or their dependencies.
Analyze error patterns. Search the codebase for error handling related to the failure:
Trace data flow. Map the request path from entry point to failure point:
Identify contributing factors. Distinguish between root cause and contributing factors:
Validate the hypothesis. Confirm the root cause by checking:
Generate the postmortem report. Create a structured document in docs/postmortems/YYYY-MM-DD-<slug>.md with these sections:
Create or update runbooks. For each failure mode identified:
docs/runbooks/<service>-<failure-mode>.mdUpdate the incident log. If docs/incidents/index.md exists, append the new incident with date, severity, MTTR, and link to the postmortem.
Tag related code. Add or update // INCIDENT-YYYY-MM-DD: <description> comments at the code locations involved in the root cause. This creates a searchable history of incident-prone code.
Calculate SLO impact. If slo.yaml or equivalent SLO definitions exist:
Evaluate alerting effectiveness. For each alert that fired (or should have fired):
Propose SLO adjustments. Based on the incident analysis:
Generate preventive action items. Categorize actions by type:
Produce the improvement summary. Output a prioritized action list with effort estimates and expected impact on MTTD and MTTR.
harness skill run harness-incident-response -- Primary CLI entry point. Runs all four phases.harness validate -- Run after generating documents to ensure project structure is intact.harness check-deps -- Verify service dependency declarations match the incident trace.emit_interaction -- Used at severity classification (checkpoint:decision) to confirm severity with the operator before proceeding.Glob -- Discover existing runbooks, postmortems, and SLO definitions.Grep -- Search for error patterns, alert configurations, and incident-related code comments.Write -- Generate postmortem reports and runbook documents.Edit -- Update existing runbooks and incident indexes.Phase 1: ASSESS
Signal: Datadog alert "api-gateway p99 latency > 2000ms" fired at 14:32 UTC
Affected: api-gateway -> user-service -> PostgreSQL
Severity: SEV2 (major degradation, 40% of requests timing out)
MTTD: 4 minutes (alert fired 4 min after first error)
MTTR: 47 minutes (resolved at 15:19 UTC)
Phase 2: INVESTIGATE
Correlated change: commit abc123 "add user preferences join" deployed at 14:25 UTC
Root cause: N+1 query in GET /api/users/:id/preferences — new LEFT JOIN
on unindexed column `preferences.user_id` caused full table scan
Contributing factors:
- No query performance test for the preferences endpoint
- Missing database index on preferences.user_id
- No circuit breaker between api-gateway and user-service
Phase 3: DOCUMENT
Created: docs/postmortems/2026-03-15-user-service-timeout.md
Created: docs/runbooks/user-service-database-slow-query.md
Updated: docs/incidents/index.md
Phase 4: IMPROVE
SLO impact: Consumed 12% of monthly error budget (88% remaining)
Action items:
1. [P0] Add index on preferences.user_id (owner: @backend, due: 2026-03-16)
2. [P1] Add query execution time assertions to integration tests (owner: @backend, due: 2026-03-22)
3. [P1] Add circuit breaker on api-gateway -> user-service (owner: @platform, due: 2026-03-22)
4. [P2] Add Datadog query performance monitor for user-service (owner: @sre, due: 2026-03-29)
Phase 1: ASSESS
Signal: PagerDuty incident #4521 — payment-service pods in CrashLoopBackOff
Affected: payment-service -> Stripe API -> order-service (downstream)
Severity: SEV1 (payment processing completely down)
MTTD: 2 minutes (PagerDuty auto-detected from Kubernetes health checks)
MTTR: 23 minutes
Phase 2: INVESTIGATE
Root cause: Environment variable STRIPE_WEBHOOK_SECRET rotated in Vault
but payment-service pods were not restarted to pick up new value.
Stripe signature verification failed on all incoming webhooks, causing
panic in the webhook handler (no error recovery).
Contributing factors:
- Vault secret rotation did not trigger pod restart
- Webhook handler used panic instead of returning error
- No runbook for secret rotation procedures
Phase 3: DOCUMENT
Created: docs/postmortems/2026-03-20-payment-service-crashloop.md
Created: docs/runbooks/payment-service-secret-rotation.md
Created: docs/runbooks/payment-service-stripe-webhook-failure.md
Updated: docs/incidents/index.md
Phase 4: IMPROVE
SLO impact: Consumed 100% of weekly error budget. Feature freeze recommended.
Action items:
1. [P0] Add Vault agent sidecar with auto-restart on secret change (owner: @platform)
2. [P0] Replace panic with error return in webhook handler (owner: @payments)
3. [P1] Add synthetic Stripe webhook test to canary suite (owner: @payments)
4. [P2] Create secret rotation runbook for all services (owner: @sre)
| Rationalization | Reality |
|---|---|
| "The root cause was human error — someone pushed a bad config" | Human error is a symptom, not a root cause. The root cause is the system that allowed a bad config to reach production undetected. A postmortem that stops at "human error" prevents no future incidents because it identifies no systemic fix. |
| "We know what happened — we don't need to write a full postmortem for a minor incident" | The decision about what is "minor" is made under the stress of recovery, not under calm analysis. Contributing factors and near-misses that look minor in the moment are frequently the root cause of the next major incident. Document while the context is fresh. |
| "The action items are in Slack — we don't need to track them formally" | Action items not tracked in a formal system with owners and due dates are not completed. Slack messages are buried within hours. The improvement phase of an incident exists only if its outputs are tracked to completion. |
| "We don't have SLOs yet so we can't calculate error budget impact" | The absence of SLOs is itself a finding. Without SLOs, there is no objective basis for deciding whether reliability is acceptable. The incident is the forcing function to establish baseline SLOs. Document this gap as a P0 action item. |
| "The incident was caused by a third-party outage — nothing we could have done" | Third-party outages expose missing circuit breakers, absent fallbacks, and insufficient multi-region routing. The postmortem should document why the third-party outage caused a customer-visible incident and what resilience improvements would have isolated the blast radius. |