From sdlc
Systematic root-cause debugging across Datadog logs/RUM/APM, Google Cloud logging, and the local codebase. Mirrors the gstack /investigate discipline (Iron Law, 3-strike rule, regression test, structured DEBUG REPORT) and routes per service. Optionally posts findings to Slack.
npx claudepluginhub jerrod/agent-plugins --plugin sdlcThis skill is limited to using the following tools:
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Systematic debugging skill. Driven by symptom, not by monitor.
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST. Fixing symptoms creates whack-a-mole debugging.
Required for full functionality (skill fails soft when missing):
| Variable | Purpose | Required for |
|---|---|---|
DD_API_KEY | Datadog API key | core-api, core-consumer, RUM (arqu-web) |
DD_APP_KEY | Datadog Application key | same as above |
gcloud CLI authenticated | Google Cloud logging | arqu-atlas-* services |
SLACK_BOT_TOKEN | Slack bot token (xoxb-...) | optional Slack pull/push |
Datadog site default: api.datadoghq.com (US1). Override with DD_SITE.
The skill does NOT hardcode a service catalog. Datadog APM traces are available for every service — always query traces. Logs and frontend errors split by where the service is hosted:
| Signal | Where to query | Applies to |
|---|---|---|
| APM traces | Datadog (dd_spans_search) | Every service — always |
| Backend logs | Datadog (dd_logs_search) | Non-atlas services (core-api, core-consumer, risklab, grs, etc.) |
| Backend logs | gcloud logging (gcloud_logs) | atlas services (arqu-atlas-*) |
| Frontend errors | Datadog RUM (dd_rum_search) | Browser apps (arqu-web, doubtfire-client, etc.) |
Heuristic for routing the LOGS query (traces are unconditional):
arqu-atlas-* → gcloud loggingIf uncertain, try Datadog first and fall back to gcloud. If still unclear
how a service is hosted or configured, inspect the infra project
(~/src/infra or whichever local checkout exists) — Terraform, Helm charts,
deployment manifests, and Datadog monitor definitions there are the source
of truth for where logs/metrics live and which service: tag is used.
Grep the infra repo for the service name and read the surrounding config
before falling back to AskUserQuestion. The lib.sh helpers validate service
names against [a-zA-Z0-9_.-]+ (format, not catalog) and reject anything
with spaces, quotes, or query operators.
Helper functions for all of these live at lib.sh in this directory. Source it at the
start of any Bash block that calls a backend:
_LIB=$(find "$HOME/.claude" -path "*/investigate/lib.sh" 2>/dev/null | sort -V | tail -1)
[ -z "$_LIB" ] && { echo "FATAL: investigate/lib.sh not found"; exit 1; }
source "$_LIB"
Read $ARGUMENTS. Extract:
https://<workspace>.slack.com/archives/...)Apply the routing heuristic above to pick the backend. If the service name is missing entirely, use AskUserQuestion to get one. Don't enumerate options — ask the user to type the service name.
If a Slack URL is present, fetch the thread first (Section: Slack pull) and incorporate the messages into the symptom set before continuing.
Always query Datadog APM traces — they exist for every service:
dd_spans_search "<service>" 1 # last 1h, error spans + trace deep links
Trace IDs link to https://app.datadoghq.com/apm/trace/<trace_id>.
Then query logs from the host-appropriate backend:
atlas services (arqu-atlas-*) → gcloud logging
gcloud config get-value project # confirm correct project first
gcloud_logs "<container>" 1 # severity>=ERROR
If the active project looks wrong, AskUserQuestion to confirm before querying.
everything else → Datadog logs
dd_logs_search "<service>" 1
If the service is a frontend / browser app (e.g. arqu-web,
doubtfire-client, or any FE project — RUM is enabled across all FE), also query RUM:
dd_rum_search "<service>" 1
Group by @error.message and @view.url. Errors clustered on a single
browser version → likely client-side, not a backend regression.
Look for: a single dominant error (likely root cause), a cluster correlated with a recent deploy (regression), or a new error appearing for the first time today (recent change).
Locate the affected repo for the service. Look in ~/src/ for a directory
matching the service name or its parent project. If you can't find it, ask
the user. Use Grep to locate the error message verbatim. Read the surrounding
code (functions calling this code path).
git -C <repo> log --oneline -20 -- <affected files>. Was this working
before? Regression means the root cause is in the diff.
A single specific testable hypothesis with file:line if known.
Identify the narrowest directory containing the affected files. Tell the user:
Edits restricted to
<dir>/for this debug session. This prevents unrelated changes during root-cause work.
Self-enforce: do not edit files outside that directory until Phase 4 begins. If the bug genuinely spans the whole repo, skip the lock and note why.
Match the symptom against known patterns:
| Pattern | Signature | Where to look |
|---|---|---|
| Race condition | Intermittent, timing-dependent | Concurrent access to shared state |
| Nil/null propagation | NoneType / TypeError | Missing guards on optional values |
| State corruption | Inconsistent data, partial updates | Transactions, callbacks, hooks |
| Integration failure | Timeout, unexpected response | External API calls, service boundaries |
| Configuration drift | Works locally, fails in staging/prod | Env vars, feature flags, DB state |
| Stale cache | Shows old data, fixes on cache clear | Redis, CDN, browser cache |
| Celery task retry storm | Same task ID alerting repeatedly in any celery worker | Task signature, idempotency, exception handling |
Also check:
TODOS.md if it exists in the affected repogit log --grep="<error keyword>" for prior fixes touching the same filesBefore writing ANY fix:
Verify the hypothesis. Add a temporary log statement, assertion, or run a targeted Datadog/gcloud query that would confirm or refute it. Examples:
print(f"DEBUG cart={cart}") at the suspected line and trigger the repro.dd_logs_search "core-api" 1 | grep "promo_code" to confirm the
error correlates with carts missing that field.If wrong, return to Phase 1, gather more evidence, do not guess.
3-strike rule: track hypotheses tested in this session. After 3 failed hypotheses, STOP and AskUserQuestion:
3 hypotheses tested, none match. This may be architectural rather than a simple bug.
A) Continue investigating — I have a new hypothesis: [describe]
B) Escalate for human review — this needs someone who knows the system
C) Add logging and wait — instrument the area and catch it next time
Red flags: "quick fix for now" (there is none); proposing a fix before tracing data flow (you're guessing); each fix revealing a new problem (wrong layer).
Only after the hypothesis is verified:
Smallest change that eliminates the root cause. Resist refactoring adjacent code.
Regression test that:
Test runner per repo:
pytestvitestRun the full test suite for the affected repo. Paste output. No regressions allowed.
If the fix touches >5 files, AskUserQuestion:
This fix touches N files. That's a large blast radius for a bug fix.
A) Proceed — the root cause genuinely spans these files
B) Split — fix the critical path now, defer the rest
C) Rethink — maybe there's a more targeted approach
Revert any temporary diagnostic edits from Phase 3 before committing.
Fresh verification: Reproduce the original bug scenario (or re-query the backend after the fix lands and confirm error rate dropped). This is not optional.
Emit the structured DEBUG REPORT:
DEBUG REPORT
════════════════════════════════════════
Symptom: [what the user observed]
Service: [core-api / arqu-atlas-celery-worker / etc]
Backend used: [Datadog logs+APM / gcloud logging / RUM / code-only]
Root cause: [what was actually wrong]
Fix: [file:line references]
Evidence: [test output + Datadog/gcloud query results]
Regression test: [file:line of new test]
Datadog link: [https://app.datadoghq.com/... if applicable]
Related: [TODOS items, prior bugs in same area]
Status: DONE | DONE_WITH_CONCERNS | BLOCKED
════════════════════════════════════════
If --slack-channel <C> was passed (or INVESTIGATE_SLACK_CHANNEL is set), post
the report to Slack now (Section: Slack push).
Prefer the Slack MCP server when available. If MCP Slack tools are present
in the session (tool names like mcp__slack__*), use them directly — they
handle auth and rate limiting for you. The lib.sh slack_* helpers are a
fallback for when MCP is not available, using SLACK_BOT_TOKEN from env.
Detection order at the top of any Slack step:
SLACK_BOT_TOKEN set → use lib.sh helpers (slack_post, slack_thread_fetch)If $ARGUMENTS contains a Slack thread URL, fetch the thread and use the
messages as additional symptom context in Phase 1. With MCP: call the
appropriate mcp__slack__* tool for thread replies. Without MCP:
slack_thread_fetch "<url>"
After the DEBUG REPORT is emitted, if --slack-channel <C> was passed (or
INVESTIGATE_SLACK_CHANNEL is set), post the report to that channel. With
MCP: call the appropriate mcp__slack__* post-message tool. Without MCP:
slack_post "<channel>" "$(cat <<'EOF'
DEBUG REPORT
...full report...
EOF
)"
Use a code block (triple backticks) inside the message. On any Slack error (channel not found, rate limit, auth), warn and continue — never block.
DONE / DONE_WITH_CONCERNS / BLOCKED.