Manages Rootly incident operations using CLI and API including searching incidents, analyzing service criticality, and discovering resolution patterns. Use when analyzing incidents, assessing service tiers, or mining historical incident data.
npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellenceThis skill uses the workspace's default tool permissions.
**SCOPE:** Incident management and analysis using Rootly CLI and API.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
SCOPE: Incident management and analysis using Rootly CLI and API.
PHILOSOPHY: "Learn from past incidents. Similar problems often have similar solutions."
Rootly CLI:
├── rootly pulse (send pulse notifications)
└── rootly pulse-run (trigger pulse workflows)
Rootly API:
├── /v1/incidents (search, filter, analyze)
├── /v1/services (service metadata)
├── /v1/teams (team ownership)
└── /v1/severities (severity definitions)
Common Patterns:
├── Service incident history
├── Severity distribution analysis
├── MTTR calculation
└── Service criticality assessment
SLO tier determination | Incident analysis | Service reliability assessment | Historical pattern discovery | MTTR analysis | Criticality scoring
Install Rootly CLI and dependencies:
# Run installation script
./plugins/operational-excellence/scripts/install-rootly.sh
# Verify installation
rootly --version
jq --version
curl --version
Environment variables:
export ROOTLY_API_KEY="your-api-key" # Get from https://rootly.com/account/api-keys
# Add to shell profile for persistence
echo 'export ROOTLY_API_KEY="your-api-key"' >> ~/.zshrc
Verify API access:
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
https://api.rootly.com/v1/incidents?page[size]=1 | jq .
Find incidents by service or query:
# Search by service
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=payment-service&page[size]=10" | \
jq '.data[] | {
id: .id,
title: .attributes.title,
severity: .attributes.severity,
status: .attributes.status,
started_at: .attributes.started_at,
resolved_at: .attributes.resolved_at
}'
# Search by keyword
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[search]=database+timeout&page[size]=5" | \
jq '.data[] | {id, title: .attributes.title, summary: .attributes.summary}'
# Filter by status
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[status]=resolved&page[size]=20" | jq .
# Filter by severity
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[severity]=critical&page[size]=10" | jq .
# Date range filtering
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[started_at_gte]=2025-01-01T00:00:00Z&page[size]=50" | jq .
Response structure:
{
"data": [
{
"id": "incident-123",
"type": "incidents",
"attributes": {
"title": "Payment API high latency",
"summary": "p99 latency exceeded 2s",
"severity": "high",
"status": "resolved",
"started_at": "2025-01-15T14:30:00Z",
"resolved_at": "2025-01-15T15:45:00Z"
}
}
],
"meta": {
"total_count": 42,
"page": 1
}
}
Fetch all incidents across multiple pages:
#!/bin/bash
# Fetch all incidents for a service (handles pagination)
SERVICE="payment-service"
PAGE=1
ALL_INCIDENTS="[]"
while true; do
RESPONSE=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=$SERVICE&page[number]=$PAGE&page[size]=20")
INCIDENTS=$(echo "$RESPONSE" | jq '.data')
TOTAL=$(echo "$RESPONSE" | jq '.meta.total_count')
if [ "$(echo "$INCIDENTS" | jq 'length')" -eq 0 ]; then
break
fi
ALL_INCIDENTS=$(echo "$ALL_INCIDENTS" | jq ". + $INCIDENTS")
PAGE=$((PAGE + 1))
done
echo "$ALL_INCIDENTS" | jq 'length' # Total incidents fetched
Determine service tier from incident history:
#!/bin/bash
# Assess service criticality based on incident severity distribution
SERVICE="my-service"
# Fetch 90 days of incidents
INCIDENTS=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=$SERVICE&filter[started_at_gte]=$(date -u -v-90d +%Y-%m-%dT%H:%M:%SZ)&page[size]=100")
# Count by severity
TOTAL=$(echo "$INCIDENTS" | jq '.data | length')
CRITICAL=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "critical")] | length')
HIGH=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "high")] | length')
# Calculate criticality percentage
CRITICALITY=$(echo "scale=2; ($CRITICAL + $HIGH) * 100 / $TOTAL" | bc)
# Determine tier
if (( $(echo "$CRITICALITY >= 70" | bc -l) )); then
TIER=1
SLO="99.9%"
elif (( $(echo "$CRITICALITY >= 40" | bc -l) )); then
TIER=2
SLO="99.5%"
else
TIER=3
SLO="99.0%"
fi
echo "Service: $SERVICE"
echo "Total incidents (90d): $TOTAL"
echo "Critical + High: $(($CRITICAL + $HIGH)) (${CRITICALITY}%)"
echo "Recommended Tier: $TIER (${SLO} uptime SLO)"
Criticality thresholds:
| Criticality % | Tier | Uptime SLO | Service Type |
|---|---|---|---|
| ≥ 70% | 1 | 99.9% | Critical (payment, auth, core API) |
| 40-69% | 2 | 99.5% | Important (notifications, reporting) |
| < 40% | 3 | 99.0% | Non-critical (admin tools, analytics) |
Calculate mean time to recovery:
# Calculate MTTR for resolved incidents
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=my-service&filter[status]=resolved&page[size]=50" | \
jq -r '.data[] |
select(.attributes.started_at != null and .attributes.resolved_at != null) |
{
id: .id,
started: .attributes.started_at,
resolved: .attributes.resolved_at,
duration: (((.attributes.resolved_at | fromdateiso8601) - (.attributes.started_at | fromdateiso8601)) / 60)
}' | \
jq -s 'add / length' | \
awk '{printf "%.2f minutes\n", $1}'
MTTR by severity:
# Critical incidents MTTR
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[severity]=critical&filter[status]=resolved&page[size]=50" | \
jq -r '.data[] |
{duration: (((.attributes.resolved_at | fromdateiso8601) - (.attributes.started_at | fromdateiso8601)) / 60)}' | \
jq -s 'map(.duration) | add / length' | \
awk '{printf "Critical MTTR: %.2f minutes\n", $1}'
Notify about deployments or events:
# Send deployment pulse
rootly pulse --api-key "$ROOTLY_API_KEY" \
--environments prod \
"Deployed payment-service v2.3.4 to production"
# Send custom pulse
rootly pulse --api-key "$ROOTLY_API_KEY" \
--services payment-service \
--environments prod \
"Database migration completed successfully"
# Trigger pulse workflow
rootly pulse-run --api-key "$ROOTLY_API_KEY" \
--workflow-id workflow-123 \
--payload '{"service": "payment-service", "version": "v2.3.4"}'
Workflow: Use Rootly data to inform SLO targets:
#!/bin/bash
# SLO validation workflow using Rootly incident data
SERVICE="payment-api"
# 1. Fetch 90-day incident history
echo "📊 Fetching incident history..."
INCIDENTS=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=$SERVICE&filter[started_at_gte]=$(date -u -v-90d +%Y-%m-%dT%H:%M:%SZ)&page[size]=100")
# 2. Calculate criticality
TOTAL=$(echo "$INCIDENTS" | jq '.data | length')
CRITICAL=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "critical")] | length')
HIGH=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "high")] | length')
CRITICALITY=$(echo "scale=2; ($CRITICAL + $HIGH) * 100 / $TOTAL" | bc)
# 3. Determine tier
if (( $(echo "$CRITICALITY >= 70" | bc -l) )); then
TIER=1; SLO="99.9"
elif (( $(echo "$CRITICALITY >= 40" | bc -l) )); then
TIER=2; SLO="99.5"
else
TIER=3; SLO="99.0"
fi
# 4. Generate baseline-monitoring config
cat <<EOF
baseline-monitoring:
slos:
uptime:
enabled: true
target: "$SLO"
timeframe: "7d"
description: "Tier $TIER service - ${CRITICALITY}% critical/high incidents (90d)"
global:
mandatory:
tier: $TIER
app: "$SERVICE"
EOF
Pattern matching for incident correlation:
#!/bin/bash
# Find incidents with similar descriptions
SEARCH_TERM="database timeout"
# Search incidents
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[search]=$SEARCH_TERM&page[size]=10" | \
jq '.data[] | {
id: .id,
title: .attributes.title,
severity: .attributes.severity,
status: .attributes.status,
summary: .attributes.summary
}'
# Filter by resolved for solution mining
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[search]=$SEARCH_TERM&filter[status]=resolved&page[size]=5" | \
jq '.data[] | {
title: .attributes.title,
summary: .attributes.summary,
resolution: .attributes.resolution_summary,
mttr_minutes: (((.attributes.resolved_at | fromdateiso8601) - (.attributes.started_at | fromdateiso8601)) / 60)
}'
#!/bin/bash
# Generate service reliability report
SERVICE="my-service"
# Fetch incidents
INCIDENTS=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=$SERVICE&filter[started_at_gte]=$(date -u -v-90d +%Y-%m-%dT%H:%M:%SZ)&page[size]=200")
# Calculate metrics
TOTAL=$(echo "$INCIDENTS" | jq '.data | length')
CRITICAL=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "critical")] | length')
HIGH=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.severity == "high")] | length')
RESOLVED=$(echo "$INCIDENTS" | jq '[.data[] | select(.attributes.status == "resolved")] | length')
# Calculate average MTTR
AVG_MTTR=$(echo "$INCIDENTS" | jq -r '.data[] |
select(.attributes.status == "resolved" and .attributes.started_at != null and .attributes.resolved_at != null) |
(((.attributes.resolved_at | fromdateiso8601) - (.attributes.started_at | fromdateiso8601)) / 60)' | \
jq -s 'add / length')
# Report
cat <<EOF
=== Service Reliability Report ===
Service: $SERVICE
Period: Last 90 days
Total incidents: $TOTAL
├── Critical: $CRITICAL
├── High: $HIGH
├── Resolved: $RESOLVED
└── Avg MTTR: $(printf "%.2f" $AVG_MTTR) minutes
Criticality: $(echo "scale=2; ($CRITICAL + $HIGH) * 100 / $TOTAL" | bc)%
EOF
# Last 7 days incident trend
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[started_at_gte]=$(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ)&page[size]=100" | \
jq -r '.data[] |
[.attributes.started_at[0:10], .attributes.severity, .attributes.services[0].name] |
@tsv' | \
sort | uniq -c | \
awk '{print $1, $2, $3, $4}'
# Find all incidents affecting multiple services (cascade failures)
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[started_at_gte]=$(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ)&page[size]=50" | \
jq '.data[] | select(.attributes.services | length > 1) | {
title: .attributes.title,
services: [.attributes.services[].name],
severity: .attributes.severity
}'
When no incidents found:
INCIDENTS=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=service-name&page[size]=100")
TOTAL=$(echo "$INCIDENTS" | jq '.data | length')
if [ "$TOTAL" -eq 0 ]; then
echo "Zero incidents found - either highly reliable or not yet deployed"
echo "Recommendation: Use Datadog metrics + service type to determine tier"
echo ""
echo "Default tier selection:"
echo " - Real-time services (audio/video) → Tier 1 (conservative)"
echo " - Standard APIs → Tier 2 (conservative)"
echo " - Batch/worker services → Tier 3"
fi
Common API errors:
# Test API connectivity
RESPONSE=$(curl -s -w "\n%{http_code}" -H "Authorization: Bearer $ROOTLY_API_KEY" \
https://api.rootly.com/v1/incidents?page[size]=1)
HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
case $HTTP_CODE in
200)
echo "✅ API access successful"
;;
401)
echo "❌ Unauthorized: Check ROOTLY_API_KEY"
;;
404)
echo "❌ Not found: Check endpoint URL"
;;
429)
echo "⚠️ Rate limit exceeded: Wait before retrying"
;;
*)
echo "❌ Error: HTTP $HTTP_CODE"
;;
esac
# Issue: Empty results
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[services]=my-service" | jq '.data | length'
# → 0
# Troubleshooting steps:
# 1. Verify service name (case-sensitive)
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/services" | jq '.data[] | .attributes.name'
# 2. Try broader search
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?filter[search]=my-service"
# 3. Check date range
curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?page[size]=1" | jq '.data[0].attributes.started_at'
# Issue: jq parse error
# Fix: Add error handling
RESPONSE=$(curl -s -H "Authorization: Bearer $ROOTLY_API_KEY" \
"https://api.rootly.com/v1/incidents?page[size]=5")
if echo "$RESPONSE" | jq empty 2>/dev/null; then
echo "$RESPONSE" | jq .
else
echo "❌ Invalid JSON response:"
echo "$RESPONSE"
fi
Rootly API: https://docs.rootly.com/api-reference
Rootly CLI: https://github.com/rootlyhq/cli
Installation: Run ./plugins/operational-excellence/scripts/install-rootly.sh
SLO Integration: See skill:operational-excellence:slo-management-recipe