From sdlc
Query Datadog API for alerting/warning monitors, investigate each one, then fix real bugs or tune noisy alarms. Filters by environment (default production).
npx claudepluginhub jerrod/agent-plugins --plugin sdlcThis skill is limited to using the following tools:
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Query the Datadog API for monitors in alert, warn, or no-data state, investigate each one, and either fix the underlying issue or tune the alarm to reduce noise.
The following environment variables must be set (typically in ~/.zshrc):
DD_API_KEY — Datadog API keyDD_APP_KEY — Datadog Application keyThe Datadog site is api.datadoghq.com (US1). If the org moves to a different site, update the DD_SITE variable below.
Parse the user's argument: $ARGUMENTS
productionstaging, development, all)all, do not filter by environmentUse the Datadog Monitor Search API to find monitors that are alerting, warning, or in no-data state.
# Set site (default US1)
DD_SITE="${DD_SITE:-api.datadoghq.com}"
# Search for monitors by status
# For a specific environment:
curl -s "https://$DD_SITE/api/v1/monitor/search?query=env:${ENV}" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY"
# For "all" environments, omit the env filter:
curl -s "https://$DD_SITE/api/v1/monitor/search" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY"
For each monitor returned, check the status field. Focus on monitors that are not OK:
Alert — actively firingWarn — warning threshold breachedNo Data — monitor is not receiving expected metricsFor monitors in Alert or Warn status, fetch full details including group states:
curl -s "https://$DD_SITE/api/v1/monitor/${MONITOR_ID}?group_states=all" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY"
Extract from each monitor:
Group monitors by priority:
Present a numbered summary table to the user:
| # | Monitor | Service | Env | Status | Assessment |
|---|---------|---------|-----|--------|------------|
| 1 | Error Count | arqu-web | prod | ALERT | Investigate |
| 2 | CPU Usage | core-api | prod | Warn | Tune candidate |
| 3 | API Latency | core-api | prod | No Data | Fix or silence |
Ask the user which alerts to investigate, or proceed with all Alert + Warn monitors if they say "all".
For each alert being investigated, run these diagnostic steps in parallel where possible:
The alarm definitions live in this repo at:
infra/devops/terraform/modules/monitoring/alarms/{service_name}/
Service name mapping:
arqu-web -> arqu_web/core-api -> core_api/config-api -> config_api/eventproc -> eventproc/redis-cache -> redis_cache/doubtfire-client -> doubtfire_client/system/Read the relevant .tf file(s) to understand:
-@error.message:*...)last_5m, last_15m)Also read locals.tf and variables.tf if present for environment-specific overrides.
For monitors in Alert/Warn, fetch recent metric data to understand the trend:
# Get the monitor's recent state changes
curl -s "https://$DD_SITE/api/v1/monitor/${MONITOR_ID}?group_states=all" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY"
Check the state.groups in the response to see:
last_triggered_ts)Based on the alert type, search the relevant application codebases:
Grep and Glob to find the source.git log --oneline -20 in the service repo)For each alert, classify it as one of:
If the alert indicates a genuine bug or regression:
If the alert is firing on expected/benign behavior:
Common tuning actions (edit the .tf file):
-@error.message:*pattern* to exclude known benign errorslast_5m to last_15m to smooth out spikescriticalRecovery / warningRecovery to prevent flappingExample — adding an exclusion to a RUM error count monitor:
# Before
query: "rum(\"service:arqu-web @type:error env:production\").rollup(\"count\").last(\"5m\") > 50"
# After — exclude a known benign error
query: "rum(\"service:arqu-web @type:error -@error.message:*ResizeObserver\\ loop\\ completed* env:production\").rollup(\"count\").last(\"5m\") > 50"
IMPORTANT: When editing RUM queries, escape spaces with \\ (double backslash + space) inside the query string. Wildcards use *.
If a monitor is in "No Data" state:
notify_no_data: false or muting itIf the alert triggered once and recovered, and diagnostics show the system is healthy:
renotifyInterval or minimum duration to prevent noiseFor each alarm tune or code fix:
When tuning alarms, follow the existing patterns in the .tf files:
kubectl_manifest resource structurequery, options.thresholds, or add exclusion filtersAfter processing all alerts, present a summary:
## Datadog Triage Summary
### Fixes Applied
- [service] Description of fix (file changed)
### Alarms Tuned
- [monitor-name] What was changed and why (file changed)
### No Action Needed
- [monitor-name] Transient / already recovered
### No Data Monitors
- [monitor-name] Status and recommended action
### Recommended Follow-ups
- Any longer-term suggestions (e.g., "consider splitting this monitor", "add SLO")
@slack-datadog-{env}, @pagerduty-{service})DD_API_KEY and DD_APP_KEY from the environment — never hardcode thesehttps://app.datadoghq.com/monitors/{monitor_id} for direct access