Analyze Prometheus Alert Command

Purpose

Investigate a Prometheus alert by fetching alert details, querying related metrics, searching the codebase for relevant code, analyzing git history, and generating a root cause analysis report.

Instructions for Claude

When this command is invoked, perform comprehensive root cause analysis for a Prometheus alert:

Step 1: Gather Alert Information

If user provided alert name:

Use prometheus MCP tool to fetch alert details
Extract alert expression, labels, severity, and active time

If user provided alert JSON:

Parse the alert JSON
Extract key fields: alertname, expr, labels, annotations, activeAt

If no argument provided:

Use AskUserQuestion to ask:
- "What is the alert name?"
- Or "Please paste the alert JSON"
Fetch or parse accordingly

Extract from alert:

Alert name
PromQL expression that triggered
Threshold value
Time alert started firing
Affected labels (service, instance, job, etc.)
Alert severity
Alert annotations (description, summary)

Step 2: Query Prometheus Metrics

Use prometheus MCP tools to query related metrics:

Execute alert expression to see current value:
- Query the alert's PromQL expression
- Check how far above threshold the metric is
Query metric over time to see pattern:
- Use query_range for last 2-6 hours
- Identify: sudden spike, gradual increase, or other pattern
Query breakdown metrics:
- Break down by labels (endpoint, instance, status code)
- Identify which specific component is affected
Query correlated metrics:
- Error rate metrics
- Latency metrics (p95, p99)
- Resource usage (CPU, memory)
- Request rate metrics
- Dependency health metrics
Analyze metric patterns:
- When did anomaly start?
- Sudden or gradual?
- Correlated with other metric changes?

Step 3: Search Codebase

Based on alert and metrics, search code for relevant files:

From alert labels, identify:
- Service name
- Endpoint or feature affected
- Component mentioned in annotations
Use Grep to search for:
- Error messages from alert annotations
- Service names
- Endpoint paths
- Metric names used in alert
Use Glob to find relevant files:
- Configuration files
- Service implementation files
- Database connection files
- API endpoint files
Read suspicious files to understand code

Step 4: Analyze Git History

Use Bash to run git commands:

Determine investigation time window:
- Start: 6-24 hours before alert fired
- End: When alert started
List recent commits:

git log --since="<time-window-start>" --until="<alert-start-time>" --oneline

Check commits to affected files:

git log --since="<time-window>" -- path/to/relevant/file.js

For suspicious files, use git blame:

git blame path/to/file.js | grep "suspicious code"

Show commit details for recent changes:

git show <commit-sha>

Correlate commit timestamps with alert start time

Step 5: Identify Root Cause

Synthesize all evidence to identify root cause:

Timeline correlation:
- When did metric anomaly start?
- Were there deployments/commits just before?
- Do timestamps align?
Code analysis:
- What code changes were made?
- Do changes affect the alerted metric?
- Are there obvious bugs or config errors?
Metric validation:
- Do metrics support the hypothesis?
- Are there correlated metric changes?
Apply Five Whys:
- Why did alert fire? → Metric exceeded threshold
- Why did metric exceed threshold? → (dig deeper)
- Continue until reaching root cause
Identify specific root cause:
- File: exact file path
- Line: specific line or section
- Commit: SHA and author
- Change: what was changed
- Why: why it caused the alert

Step 6: Generate RCA Report

Use Write tool to create RCA report:

Create report file: rca-reports/prometheus-<alert-name>-<date>.md
Use RCA report template from root-cause-analysis skill
Include:
- Summary: Brief overview of alert and root cause
- Alert Details: Name, expression, threshold, labels
- Timeline: When alert fired, when anomaly started, recent commits
- Metrics Evidence: Graphs or data showing metric pattern
- Root Cause: Specific code/config change with commit SHA
- Evidence: Git commit details, code diffs, metric correlations
- Suggested Fix: How to resolve the issue
- Prevention: How to prevent recurrence
Output summary to console for user

Step 7: Suggest Next Steps

After generating report, suggest:

Immediate mitigation (revert commit, adjust config, scale resources)
Validation steps (test fix in staging, verify metrics recover)
Prevention measures (add tests, improve monitoring, code review checklist)

Example Invocations

With alert name:

/rca:analyze-prometheus HighErrorRate

With alert JSON:

/rca:analyze-prometheus {
  "labels": {"alertname": "HighErrorRate", "service": "api"},
  "annotations": {"summary": "Error rate above 5%"},
  "startsAt": "2025-12-19T14:32:00Z"
}

Interactive (no arguments):

/rca:analyze-prometheus

→ Prompts user for alert details

Key Considerations

Load skills: Activate prometheus-analysis, root-cause-analysis, and git-investigation skills as needed
Use MCP tools: Leverage prometheus MCP server for metrics
Correlate timestamps: Always align git commits with alert start time
Be specific: Root cause should point to exact file/line/commit
Evidence-based: Support conclusions with metrics and code
Actionable: Provide clear fix and prevention steps

Output

Comprehensive RCA report saved to rca-reports/ directory
Console summary of findings
Clear root cause identification
Suggested remediation steps

Perform systematic, thorough root cause analysis that transforms a Prometheus alert into actionable insights with clear evidence and specific fixes.

/analyze-prometheus