From ai-toolkit
Investigates monitoring alerts end-to-end by pulling metrics, logs, traces, and recent code changes to identify root causes. For on-call engineers handling alerts via Datadog, Grafana, or PagerDuty MCPs.
npx claudepluginhub c0x12c/ai-toolkit --plugin ai-toolkitThis skill uses the workspace's default tool permissions.
Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Designs, implements, and audits WCAG 2.2 AA accessible UIs for Web (ARIA/HTML5), iOS (SwiftUI traits), and Android (Compose semantics). Audits code for compliance gaps.
Investigate a monitoring alert by pulling metrics, logs, traces, and related service code. Symptoms in, root cause hypothesis out.
Check which monitoring MCP servers are available. Look for any mcp__* tools related to monitoring platforms (Datadog, Grafana, PagerDuty, etc.).
Recommended: Datadog MCP — provides the richest investigation surface (monitors, metrics, logs, traces, events in one platform).
If no monitoring MCP is available, stop with:
Error: No monitoring MCP server found. This skill requires a monitoring MCP to query alert data. Recommended: add the Datadog MCP to your Claude Code MCP settings.
Also check for optional tools:
gh) — for reading related service code and recent deploysNote which are available — adapt the investigation accordingly.
If a monitoring platform URL:
If an alert name or description:
Retrieve the monitor configuration and current state:
Fetch the metric(s) that triggered the alert:
Search logs for the affected service and environment:
Search for distributed traces:
If Kubernetes MCP or cloud CLI is available:
If not available (VPN, permissions), note it and continue with available data.
gh available)gh auth status
If authenticated:
gh api repos/<org>/<service>/tags --jq '.[0:3] | .[] | {name: .name, sha: .commit.sha}'
gh api repos/<org>/<service>/compare/<prev-tag>...<latest-tag> --jq '.commits[] | {sha: .sha[:7], message: .commit.message, author: .commit.author.name}'
NEVER create, push, or modify tags.
## Alert Investigation: <Alert Name>
**Status:** <OK / Alert / Warn / No Data>
**Service:** <service> | **Env:** <env>
**Triggered:** <timestamp> | **Duration:** <duration or "Ongoing">
### Metrics
<key observations — spike at X time, value Y vs threshold Z>
### Logs
<key log lines or patterns — N errors of type X, stack trace summary>
### Traces
<latency or error observations — if available>
### Infrastructure
<pod status, resource usage — if available>
### Recent Code Changes
<commits near trigger time, or "No recent changes" or "gh CLI not available">
### Root Cause Hypothesis
<best assessment based on available data — be explicit about confidence level>
### Recommended Next Steps
1. <most impactful action>
2. <secondary action>
3. <what to check if hypothesis is wrong>
If data is inconclusive, say so explicitly and suggest what to check manually (e.g., VPN access to k8s, direct DB query, checking with the team).
Present the investigation summary inline in the conversation. No file output unless the user asks to save it.