Skill

incident-response

Automates incident response for CloudWatch/Prometheus alarms: classifies severity, retrieves runbooks, generates hypotheses, runs MCP diagnostics, and executes human-approved remediations. Pages on-call for SEV1.

AWS

Prometheus

Bash

devops

monitoring

npx claudepluginhub aws-samples/sample-oh-my-aidlcops --plugin agenticops

Configuration

Model: claude-opus-4-7

Tool Access

This skill is limited to using the following tools:

ReadGrepBashmcp__cloudwatchmcp__prometheus

Preview

- CloudWatch Alarm 또는 Prometheus AlertManager가 임계 초과 알람을 발송했을 때

SKILL.md

Similar Skills

using-superpowers

185.1k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Parent Repo Stars7

Parent Repo Forks2

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Severity	기준	사람 개입
SEV1	Toxicity/PII leakage 양성, 데이터 유출, 프로덕션 전체 장애, 30% 이상 트래픽 에러	즉시 on-call page. Agent는 진단만 수행하고 remediation은 실행하지 않음.
SEV2	서비스 부분 장애, P99 latency 2× 이상 증가, circuit breaker trip, 특정 region 장애	Agent가 drafted response 준비 → 사람 승인 후 실행.
SEV3	품질 regression (faithfulness -5pp 등), 비용 급증, 단일 agent 에러율 증가	Agent가 drafted response 준비 → 사람 승인 후 실행.
SEV4	경고성 (log volume 증가, token 사용량 15% 증가 등)	Agent가 리포트만 생성하고 주간 리뷰 큐에 적재.

Severity

기준

사람 개입

SEV1

Toxicity/PII leakage 양성, 데이터 유출, 프로덕션 전체 장애, 30% 이상 트래픽 에러

즉시 on-call page. Agent는 진단만 수행하고 remediation은 실행하지 않음.

SEV2

서비스 부분 장애, P99 latency 2× 이상 증가, circuit breaker trip, 특정 region 장애

Agent가 drafted response 준비 → 사람 승인 후 실행.

SEV3

품질 regression (faithfulness -5pp 등), 비용 급증, 단일 agent 에러율 증가

Agent가 drafted response 준비 → 사람 승인 후 실행.

SEV4

경고성 (log volume 증가, token 사용량 15% 증가 등)

Agent가 리포트만 생성하고 주간 리뷰 큐에 적재.

ALARM_ID="$1" ALARM=$(aws cloudwatch describe-alarms --alarm-names "$ALARM_ID" --query 'MetricAlarms[0]' --output json) # Or via MCP # mcp__cloudwatch__get_alarm --name "$ALARM_ID" SEVERITY=$(jq -r '.Tags[] | select(.Key=="severity") | .Value' <<< "$ALARM")

SYMPTOM=$(jq -r '.AlarmDescription' <<< "$ALARM" | sed 's/[^a-z0-9-]/-/g') RUNBOOK=$(ls .omao/plans/runbooks/*.md | grep -i "$SYMPTOM" | head -1) if [ -z "$RUNBOOK" ]; then echo "No matching runbook. Proceeding with generic diagnostic flow." fi

{ "hypotheses": [ { "id": "H1", "claim": "Retrieval index outdated after 2026-04-20 reindex job", "diagnostic_query": "cloudwatch: /aws/lambda/reindex-job last 24h", "confidence_prior": 0.4 }, { "id": "H2", "claim": "New model version v2.3.1 introduced context window truncation", "diagnostic_query": "prometheus: agent_context_truncation_total{version='v2.3.1'}", "confidence_prior": 0.3 }, { "id": "H3", "claim": "Vector DB (Milvus) slow query due to compaction backlog", "diagnostic_query": "prometheus: milvus_compaction_queue_length", "confidence_prior": 0.3 } ] }

# Hypothesis H1: reindex job health mcp__cloudwatch__filter_log_events \ --log-group /aws/lambda/reindex-job \ --start-time $(date -u -d '-24 hours' +%s)000 \ --filter-pattern "ERROR" # Hypothesis H2: context truncation mcp__prometheus__query_range \ --query 'rate(agent_context_truncation_total{version="v2.3.1"}[5m])' \ --start "$(date -u -d '-6 hours' +%s)" \ --end "$(date -u +%s)" # Hypothesis H3: Milvus compaction mcp__prometheus__query \ --query 'milvus_compaction_queue_length'

cat > .omao/state/incident/sev2-20260421-1023/remediation.sh <<'EOF' #!/bin/bash # Proposed remediation for SEV2 incident # Root cause: Milvus compaction backlog (H3 confidence 0.82) # Reviewer: please approve before execution kubectl -n milvus exec milvus-proxy-0 -- milvus-cli \ --command "compact -collection=agent_kb" # Verify kubectl -n milvus exec milvus-proxy-0 -- milvus-cli \ --command "describe -collection=agent_kb" | grep "compaction_state" EOF echo "Drafted remediation at .omao/state/incident/sev2-20260421-1023/remediation.sh" echo "Approve via: gh issue comment <issue-id> --body '/approve-remediation'"

# Page on-call immediately curl -X POST "$PAGERDUTY_INCIDENT_URL" \ -H "Authorization: Token token=$PD_TOKEN" \ -d "$(jq -n --arg id "$ALARM_ID" --arg desc "$SEVERITY $SYMPTOM" \ '{incident:{type:"incident",title:$desc,service:{id:"PXXXXX",type:"service_reference"},urgency:"high"}}')" # Freeze autopilot-deploy echo '{"circuit_breaker_status":"tripped","reason":"SEV1 incident"}' \ > .omao/state/autopilot-deploy/freeze.json

[12:35Z] Received alarm: rag-qa-error-rate-spike [12:35Z] Severity: SEV2 (error rate 2.1× baseline for 5m) [12:35Z] Runbook match: .omao/plans/runbooks/rag-qa-error-spike.md [12:36Z] Generated 3 hypotheses [12:38Z] Diagnostic MCP queries complete [12:38Z] Root cause candidate: H3 (Milvus compaction backlog, confidence=0.82) [12:39Z] Drafted remediation at .omao/state/incident/sev2-20260421-1235/remediation.sh [12:39Z] AWAITING HUMAN APPROVAL. autopilot-deploy frozen.

[14:02Z] Received alarm: pii-leak-detected [14:02Z] Severity: SEV1 (PII token found in agent response) [14:02Z] On-call paged: PagerDuty incident P-A1B2C3 [14:02Z] autopilot-deploy frozen for all agents [14:02Z] Agent diagnosis continuing; no remediation will be auto-drafted for SEV1 [14:05Z] Diagnostic complete: see .omao/state/incident/sev1-20260421-1402/ [14:05Z] Human responder in control. Agent awaiting /release-sev1 command.

Severity	기준	사람 개입
SEV1	Toxicity/PII leakage 양성, 데이터 유출, 프로덕션 전체 장애, 30% 이상 트래픽 에러	즉시 on-call page. Agent는 진단만 수행하고 remediation은 실행하지 않음.
SEV2	서비스 부분 장애, P99 latency 2× 이상 증가, circuit breaker trip, 특정 region 장애	Agent가 drafted response 준비 → 사람 승인 후 실행.
SEV3	품질 regression (faithfulness -5pp 등), 비용 급증, 단일 agent 에러율 증가	Agent가 drafted response 준비 → 사람 승인 후 실행.
SEV4	경고성 (log volume 증가, token 사용량 15% 증가 등)	Agent가 리포트만 생성하고 주간 리뷰 큐에 적재.

Severity

기준

사람 개입

SEV1

Toxicity/PII leakage 양성, 데이터 유출, 프로덕션 전체 장애, 30% 이상 트래픽 에러

즉시 on-call page. Agent는 진단만 수행하고 remediation은 실행하지 않음.

SEV2

서비스 부분 장애, P99 latency 2× 이상 증가, circuit breaker trip, 특정 region 장애

Agent가 drafted response 준비 → 사람 승인 후 실행.

SEV3

품질 regression (faithfulness -5pp 등), 비용 급증, 단일 agent 에러율 증가

Agent가 drafted response 준비 → 사람 승인 후 실행.

SEV4

경고성 (log volume 증가, token 사용량 15% 증가 등)

Agent가 리포트만 생성하고 주간 리뷰 큐에 적재.

incident-response

Configuration

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

incident-response

Configuration

Tool Access

Preview

SKILL.md

When to Use

Prerequisites

Severity 분류 기준

5-Step Response Playbook

Step 1: Receive & Classify

Step 2: Runbook Lookup

Step 3: Hypothesis Generation

Step 4: Diagnostic MCP Queries

Step 5: Remediation — Drafted Response + Human Approval

상태 관리

Example Inputs/Outputs

참고 자료

공식 문서

기술 블로그

관련 문서 (내부)

Similar Skills

Help us improve

When to Use

Prerequisites

Severity 분류 기준

5-Step Response Playbook

Step 1: Receive & Classify

Step 2: Runbook Lookup

Step 3: Hypothesis Generation

Step 4: Diagnostic MCP Queries

Step 5: Remediation — Drafted Response + Human Approval

상태 관리

Example Inputs/Outputs

참고 자료

공식 문서

기술 블로그

관련 문서 (내부)