From core
Investigates production issues by querying Datadog logs, metrics, and APM traces, then correlating findings with codebase. Useful for debugging errors, latency spikes, alerts in deployed services.
npx claudepluginhub clipboardhealth/core-utils --plugin coreThis skill uses the workspace's default tool permissions.
Investigate production issues by querying Datadog logs, metrics, and APM traces, then correlating findings with the codebase.
Queries and analyzes Datadog logs, metrics, APM traces, and monitors via API. Useful for debugging production issues, monitoring app performance, and investigating alerts.
CLI for searching Datadog logs, querying metrics, tracing requests, summarizing errors, and managing dashboards during production debugging and observability triage.
Queries Datadog logs, metrics, APM traces, RUM events, incidents, and monitors for production debugging and performance analysis. Activates on Datadog, APM traces, or RUM mentions.
Share bugs, ideas, or general feedback.
Investigate production issues by querying Datadog logs, metrics, and APM traces, then correlating findings with the codebase.
dog) installed and configured via ~/.dogrc with apikey and appkeyEvery Datadog API call needs authentication. Extract credentials once and reuse them to keep commands readable:
DD_API_KEY=$(grep apikey ~/.dogrc | cut -d= -f2 | tr -d ' ')
DD_APP_KEY=$(grep appkey ~/.dogrc | cut -d= -f2 | tr -d ' ')
Use these variables in all subsequent curl calls. If a shell session is lost, re-extract them.
Filter by env:production unless the user specifies otherwise. Production is the default because that's where real user-impacting issues live — staging and dev issues rarely warrant this investigation workflow.
Use Node.js for portable timestamp calculations (works on macOS and Linux):
node -e "console.log(Math.floor(Date.now()/1000))" # now
node -e "console.log(Math.floor(Date.now()/1000) - 3600)" # 1 hour ago
node -e "console.log(Math.floor(Date.now()/1000) - 86400)" # 24 hours ago
When a user reports an issue, follow this flow. The goal is to move from symptoms to root cause to fix as quickly as possible.
Clarify the problem — Get service name, time range, error messages, or trace IDs. If the user is vague, start with the last hour of errors for their service.
Query logs first — Logs are the richest signal. Look for error patterns, stack traces, and trace IDs.
Correlate with traces — Use trace IDs from logs to get the full request lifecycle. This reveals which downstream service or operation actually failed.
Check metrics — Look for error rate spikes, latency increases, or resource exhaustion that coincide with the issue timeframe.
Find the code — Use error messages, stack traces, and endpoint paths to locate the relevant code. Use Serena's symbolic tools (find_symbol, search_for_pattern) rather than grep — they understand code structure and give better results.
Propose a fix — After understanding the root cause, suggest targeted code changes.
Use the Logs Search API. Default to the last 1 hour if the user doesn't specify a time range.
curl -s -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{
"filter": {
"query": "service:SERVICE_NAME status:error env:production",
"from": "now-1h",
"to": "now"
},
"sort": "-timestamp",
"page": { "limit": 50 }
}' | jq '.data[] | {timestamp: .attributes.timestamp, message: .attributes.message, status: .attributes.status, service: .attributes.service}'
service:my-service status:error env:production
trace_id:123456789 env:production
service:my-service "NullPointerException" env:production
service:my-service host:ip-10-0-1-123 env:production
service:my-service status:error env:production @http.status_code:500
now-15m, now-1h, now-24h, now-7d2024-01-15T10:00:00ZAPI responses are paginated. Extract the cursor from the response to fetch more:
response=$(curl -s -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{"filter": {"query": "service:my-service env:production", "from": "now-1h", "to": "now"}, "page": {"limit": 50}}')
cursor=$(echo "$response" | jq -r '.meta.page.after // empty')
if [ -n "$cursor" ]; then
curl -s -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{"filter": {"query": "service:my-service env:production", "from": "now-1h", "to": "now"}, "page": {"limit": 50, "cursor": "'"$cursor"'"}}'
fi
Use the dog CLI for metrics. Metrics are useful for spotting patterns (error rate spikes, latency increases) that logs alone might not reveal.
# CPU usage for a service (last hour)
dog --pretty metric query "avg:system.cpu.user{service:my-service,env:production}" \
$(node -e "console.log(Math.floor(Date.now()/1000) - 3600)") \
$(node -e "console.log(Math.floor(Date.now()/1000))")
# Request duration
dog --pretty metric query "avg:trace.http.request.duration{service:my-service,env:production}" \
$(node -e "console.log(Math.floor(Date.now()/1000) - 3600)") \
$(node -e "console.log(Math.floor(Date.now()/1000))")
# Error count
dog --pretty metric query "sum:trace.http.request.errors{service:my-service,env:production}.as_count()" \
$(node -e "console.log(Math.floor(Date.now()/1000) - 3600)") \
$(node -e "console.log(Math.floor(Date.now()/1000))")
Use the Traces API to get the full request lifecycle for specific requests.
curl -s -X POST "https://api.datadoghq.com/api/v2/spans/events/search" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{
"filter": {
"query": "service:SERVICE_NAME @http.status_code:500 env:production",
"from": "now-15m",
"to": "now"
},
"sort": "-timestamp",
"page": { "limit": 25 }
}' | jq '.data[] | {trace_id: .attributes.attributes.trace_id, resource: .attributes.resource_name, duration_ns: .attributes.duration, status: .attributes.attributes["http.status_code"]}'
curl -s -X GET "https://api.datadoghq.com/api/v1/trace/TRACE_ID" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" | jq '.'
# List all monitors
dog --pretty monitor show_all
# Show specific monitor
dog --pretty monitor show MONITOR_ID
# Search monitors by name
dog --pretty monitor show_all | jq '.monitors[] | select(.name | contains("my-service"))'
# Recent events (deployments, alerts)
dog --pretty event stream --start 1h --tags "service:my-service,env:production"
For repeated log searches, this function avoids re-typing the full curl command:
dd_logs() {
local query="$1"
[[ ! "$query" =~ env: ]] && query="$query env:production"
local limit="${3:-25}"
jq -n --arg q "$query" --arg from "${2:-now-1h}" --argjson limit "$limit" \
'{filter: {query: $q, from: $from, to: "now"}, sort: "-timestamp", page: {limit: $limit}}' | \
curl -s -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d @-
}
# Usage: dd_logs "service:my-service status:error" "now-15m" 10
| Error | Likely Cause | Fix |
|---|---|---|
| Empty results | Query too narrow or wrong time range | Expand time range (now-24h), remove filters one at a time |
| 401 Unauthorized | Invalid or missing API key | Verify ~/.dogrc has valid apikey and appkey |
| 403 Forbidden | API key lacks permissions | Check Datadog org settings for API key scopes |
| 429 Too Many Requests | Rate limited | Wait 30 seconds, reduce page.limit, narrow time range |
| Timeout | Query spans too much data | Narrow time range, add more specific filters |
jq to format all JSON output — raw API responses are unreadable