Help us improve
Share bugs, ideas, or general feedback.
From pup
Investigates a specific flaky test: fetches history, failure pattern, and category from Datadog, then recommends fix, quarantine, or escalate.
npx claudepluginhub datadog/pup --plugin pupHow this skill is triggered — by the user, by Claude, or both
Slash command
/pup:dd-triage-flaky-testThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
One-line summary: Investigate a specific flaky test — get history, failure pattern, and category, then recommend fix, quarantine, or escalate.
Triages flaky tests across any framework into root-cause categories (async races, shared state, environment coupling) and assigns remediation or quarantine paths.
Expert approach to flaky-test-remediation in test automation. Use when working with .
Manages @mizchi/flaker post-setup: review flaky metrics, design advisory/required CI gates, promote/demote Playwright E2E/VRT checks, triage quarantine, tune PR budgets in OSS repos.
Share bugs, ideas, or general feedback.
One-line summary: Investigate a specific flaky test — get history, failure pattern, and category, then recommend fix, quarantine, or escalate.
Requires: dd-pup skill (pup CLI installed and authenticated).
| Parameter | Description |
|---|---|
| Test name | Fully qualified test name (e.g. TestMyFunc or com.example.MyTest) |
| Repository | Lowercase, no-schema URL (e.g. github.com/org/repo). Derive from git remote get-url origin if not provided. |
Derive repository ID from git if not provided:
git remote get-url origin
# Strip protocol and trailing .git, then lowercase the result
# e.g. https://github.com/DataDog/my-repo.git → github.com/datadog/my-repo
Validation fallback: If STEP 1 returns no results, confirm the correct repository by searching without a repo filter:
pup cicd tests search \
--query "@test.name:\"<test-name>\"" \
--from 30d \
--limit 5
Extract @git.repository.id_v2 from results and retry STEP 1 with the confirmed value.
Preferred — use fingerprint_fqn if known (fingerprint_fqn is a valid CI Visibility search facet, distinct from flaky_state):
pup cicd flaky-tests search \
--query "fingerprint_fqn:<fqn>" \
--sort="-last_flaked" \
--limit 5
Fallback — use name + suite + repo:
pup cicd flaky-tests search \
--query "@test.name:\"<test-name>\" @test.suite:\"<suite>\" @git.repository.id_v2:\"<repo>\"" \
--sort="-last_flaked" \
--limit 10
Omit @test.suite if unknown; if the same test name appears in multiple suites, pick the entry whose suite matches the failing test.
Do not filter by flaky_test_state — return the test regardless of state.
Note: the query filter facet is flaky_test_state; the returned response attribute is flaky_state — these are different names for the same concept; do not use flaky_state:active as a query filter.
Extract from results:
fingerprint_fqn — unique test identifier; used as the id in STEP 5 write call. If absent, do not proceed to quarantine — see STEP 5.flaky_state — current state (active / quarantined / disabled / fixed)test_stats.failure_rate_pct — percentage of runs that failflaky_category — root cause categorycodeowners — owning teampipeline_stats.total_lost_time_ms — total CI time lostpup cicd tests search \
--query "@test.name:\"<test-name>\" @test.suite:\"<suite>\" @test.status:fail @git.repository.id_v2:\"<repo>\"" \
--from 7d \
--limit 20
Extract:
@error.message, @error.stack)@git.branch) — branch-specific vs. widespread@ci.pipeline.id values for blast radius (STEP 3)Count distinct pipelines impacted using pipeline IDs from STEP 2:
pup cicd events aggregate \
--query "@ci.status:error @ci.pipeline.id:(<id1> OR <id2> OR ...) @git.repository.id_v2:\"<repo>\"" \
--compute count \
--group-by "@ci.pipeline.name" \
--from 7d
Use the first 10 pipeline IDs from STEP 2 (cap at 10; if more are available, run a second batch and merge results by summing counts per @ci.pipeline.name across batches). Report blast radius as: total number of unique pipelines impacted and whether failures are branch-specific or widespread.
Note: a pipeline failure is not necessarily caused solely by this flaky test — treat blast radius as a signal, not a definitive count.
Use flaky_category from STEP 1 and error messages from STEP 2.
Root cause first:
infra and recommend retry instead.Fix at the correct layer:
Forbidden — do not propose these:
Fix patterns by category:
| Category | Approach |
|---|---|
timeout | Identify the slow operation and make it synchronous or deterministic — do NOT simply raise the timeout constant |
concurrency | Add deterministic synchronization (barriers, channels, locks); remove shared mutable state between tests |
network | Mock or stub network calls at the boundary; if the test requires a real connection, isolate it with a test server |
time | Inject a controllable clock; replace wall-clock assertions with relative or event-driven checks |
order_dependency | Isolate test state with setup/teardown; eliminate dependencies on execution order or global state |
environment_dependency | Mock env variables and external config; use test-local fixtures, not shared directories or singletons |
resource_leak | Ensure every resource opened in a test is closed in teardown; use cleanup hooks that run even on failure |
randomness | Fix the random seed for the test run; use deterministic inputs instead of random generation |
asynchronous_wait | Replace fixed sleeps with condition polling or event/signal-driven waits with a hard timeout |
io | Use temp files/dirs cleaned up in teardown; mock or stub filesystem interactions |
unknown | Skip fix attempt → go to quarantine |
Before proposing code changes, verify all of the following — if any fails, skip fix and recommend quarantine:
Decision:
unknown OR verification above fails → skip fix, recommend quarantineFlaky Test Triage Brief
=======================
Test: <fully qualified test name>
Service: <@test.service>
Category: <flaky_category>
Failure Rate: <test_stats.failure_rate_pct>%
Duration Lost: <pipeline_stats.total_lost_time_ms>ms
Codeowners: <codeowners>
Blast Radius: <N> pipelines (<branch-specific | widespread>) [approximate — other failures in the same pipeline runs may not be related]
Evidence:
<1-2 key error message lines from STEP 2>
Recommendation: <fix | quarantine | escalate>
Confidence: <high | medium | low>
Action: <specific next step>
Decision thresholds:
failure_rate_pct > 10 OR blast radius > 5 pipelines → quarantinefailure_rate_pct ≤ 10 AND known category AND clear fix → fixfailure_rate_pct ≤ 10 AND category unknown → escalate to codeowners with triage briefIf recommending quarantine, present and require explicit user approval before writing:
Proposed action: quarantine "<test-name>"
id (fingerprint_fqn): <fingerprint_fqn from STEP 1>
Effect: test still runs but failures are suppressed (CI will not be blocked)
Reversible: yes — update new_state to active to restore
Approve? (yes/no)
If fingerprint_fqn was not returned in STEP 1 (test not yet in FTM or query returned no results): do not attempt the write. Surface an error and ask the user to open the Flaky Test Management UI directly to quarantine manually.
Only after explicit approval and a confirmed fingerprint_fqn, write the body file and run:
cat > /tmp/flaky-update.json <<'EOF'
{
"data": {
"type": "UpdateFlakyTestsRequest",
"attributes": {
"tests": [{"id": "<fingerprint_fqn>", "new_state": "quarantined"}]
}
}
}
EOF
pup test-optimization flaky-tests update --file /tmp/flaky-update.json
To undo: repeat with "new_state": "active".