Skill

Harness Incident Response

Generates runbooks, postmortem analyses, and tracks SLOs/SLAs. Diagnoses incidents by tracing symptoms through services, produces structured postmortems, and maintains error budgets.

Git

Docker

Kubernetes

devops

monitoring

npx claudepluginhub intense-visions/harness-engineering --plugin harness-claude

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> Runbook generation, postmortem analysis, and SLO/SLA tracking. Diagnoses incidents by tracing symptoms through services, produces structured postmortems, and maintains error budget accounting.

Supporting Assets

skill.yaml

SKILL.md

Similar Skills

incident-responder

682

Manages SRE production incidents: assesses impact, establishes command, investigates via observability (Prometheus, OpenTelemetry, Grafana), conducts blameless post-mortems, handles error budgets.

rmyndharis-antigravity-skills

incident-response-incident-response

36.5k

Orchestrates multi-agent incident response workflows using SRE practices for detection, triage, observability analysis, mitigation, communication, resolution, and blameless postmortems.

antigravity-awesome-skills

incident

Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.

godmode

Stats

Stars12

Forks6

Last CommitApr 7, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Rationalization	Reality
"The root cause was human error — someone pushed a bad config"	Human error is a symptom, not a root cause. The root cause is the system that allowed a bad config to reach production undetected. A postmortem that stops at "human error" prevents no future incidents because it identifies no systemic fix.
"We know what happened — we don't need to write a full postmortem for a minor incident"	The decision about what is "minor" is made under the stress of recovery, not under calm analysis. Contributing factors and near-misses that look minor in the moment are frequently the root cause of the next major incident. Document while the context is fresh.
"The action items are in Slack — we don't need to track them formally"	Action items not tracked in a formal system with owners and due dates are not completed. Slack messages are buried within hours. The improvement phase of an incident exists only if its outputs are tracked to completion.
"We don't have SLOs yet so we can't calculate error budget impact"	The absence of SLOs is itself a finding. Without SLOs, there is no objective basis for deciding whether reliability is acceptable. The incident is the forcing function to establish baseline SLOs. Document this gap as a P0 action item.
"The incident was caused by a third-party outage — nothing we could have done"	Third-party outages expose missing circuit breakers, absent fallbacks, and insufficient multi-region routing. The postmortem should document why the third-party outage caused a customer-visible incident and what resilience improvements would have isolated the blast radius.

Rationalization	Reality
"The root cause was human error — someone pushed a bad config"	Human error is a symptom, not a root cause. The root cause is the system that allowed a bad config to reach production undetected. A postmortem that stops at "human error" prevents no future incidents because it identifies no systemic fix.
"We know what happened — we don't need to write a full postmortem for a minor incident"	The decision about what is "minor" is made under the stress of recovery, not under calm analysis. Contributing factors and near-misses that look minor in the moment are frequently the root cause of the next major incident. Document while the context is fresh.
"The action items are in Slack — we don't need to track them formally"	Action items not tracked in a formal system with owners and due dates are not completed. Slack messages are buried within hours. The improvement phase of an incident exists only if its outputs are tracked to completion.
"We don't have SLOs yet so we can't calculate error budget impact"	The absence of SLOs is itself a finding. Without SLOs, there is no objective basis for deciding whether reliability is acceptable. The incident is the forcing function to establish baseline SLOs. Document this gap as a P0 action item.
"The incident was caused by a third-party outage — nothing we could have done"	Third-party outages expose missing circuit breakers, absent fallbacks, and insufficient multi-region routing. The postmortem should document why the third-party outage caused a customer-visible incident and what resilience improvements would have isolated the blast radius.

Harness Incident Response

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

Harness Incident Response

Tool Access

Preview

Supporting Assets

SKILL.md