From core
Debugs and fixes flaky Playwright E2E tests using LLM reports from GitHub Actions and Datadog. Use for investigating intermittent failures, triaging flakiness, or stabilizing tests.
npx claudepluginhub clipboardhealth/core-utils --plugin coreThis skill uses the workspace's default tool permissions.
Work through these phases in order. Skip phases only when you already have the information they produce.
Investigates and fixes failing Playwright E2E tests using captured action data, screenshots, DOM snapshots, network requests, and console output.
Detects and troubleshoots flaky tests in Jest and pytest suites. Provides guidance, code, configs for unit/integration testing, mocking, and stability.
Share bugs, ideas, or general feedback.
Work through these phases in order. Skip phases only when you already have the information they produce.
Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
Downloads the playwright-llm-report artifact from a GitHub Actions run.
bash scripts/fetch-llm-report.sh "<github-actions-url>"
This downloads and extracts to /tmp/playwright-llm-report-{runId}/. The report is a single llm-report.json file.
LLM report structure:
summary -- quick pass/fail countstests[].errors[].message -- ANSI-stripped, clean error texttests[].errors[].diff -- extracted expected/actual from assertion errorstests[].errors[].location -- exact file and line of failuretests[].flaky -- true if test passed after retrytests[].attempts[] -- full retry history with per-attempt status, timing, stdio, attachments, steps, and networktests[].attempts[].consoleMessages[] -- warning/error/pageerror/page-closed/page-crashed trace entries only (2KB text cap with [truncated] marker, max 50 per attempt, high-signal entries prioritized over low-signal)tests[].steps / tests[].network / tests[].timeline -- convenience aliases from the final attempttests[].attempts[].timeline[] -- unified, sorted-by-offsetMs array of all retained events (kind: "step" | "network" | "console"). Slimmed-down entries for quick temporal scanning; full details remain in the source arraysoffsetMs -- milliseconds since the attempt's startTime. Always present on steps (from TestStep.startTime). Optional on network entries (from trace _monotonicTime or startedDateTime, converted via the trace's context-options anchor) and console entries (from trace monotonic time field + anchor). Absent when the trace lacks a context-options event. Entries without offsetMs are excluded from the timelinetests[].attempts[].network[].traceId -- promoted from x-datadog-trace-id header for direct accesstests[].attempts[].network[] -- max 200 per attempt, priority-based: fetch/xhr requests, error responses (status >= 400), failed, and aborted requests are retained over static assets (script, stylesheet, image, font). Includes failure details (failureText, wasAborted), redirect chain (redirectToUrl, redirectFromUrl, redirectChain), timing breakdown (timings), durationMs derived from available timing components, and allowlisted headers (requestHeaders, responseHeaders)tests[].attempts[].network[].responseHeaders -- includes x-datadog-trace-id and x-datadog-span-id when present (values capped to 256 chars)tests[].attempts[].failureArtifacts -- for failing/timed-out/interrupted attempts: screenshotBase64 (base64-encoded screenshot, max 512KB), videoPath (first video attachment path). Omitted entirely when neither screenshot nor video is availabletests[].attachments[].path -- relative to Playwright outputDirtests[].stdout / tests[].stderr -- capped at 4KB with [truncated] markerClassify the flake to narrow the search space:
| Category | Signal | Timeline Pattern |
|---|---|---|
| Test-state leakage | Retries or earlier tests leave auth, cookies, storage, or server state behind | attempts[] — different outcomes across retries |
| Data collision | "Random" identities aren't unique enough and collide with existing users/entities | errors[] — duplicate key or conflict errors |
| Backend stale data | API returned 200 but response body shows old state | step(action) → network(GET, 200) → step(assert) FAIL — API succeeded but data was stale |
| Frontend cache stale | No network request after navigation/reload for the relevant endpoint | step(reload) → step(assert) FAIL — no intervening network call for expected endpoint |
| Silent network failure | CORS, DNS, or transport error prevented the request from completing | step(action) → console(error: "net::ERR_FAILED") → step(assert) FAIL |
| Render/hydration bug | API returned correct data but component didn't render it | network(GET, 200, correct data) → step(assert) FAIL — no console errors |
| Environment / infra | Transient 5xx, timeouts, DNS/network instability | network entries with 5xx status; consoleMessages[] with connection errors |
| Locator / UX drift | Selector is valid but brittle against small UI changes | errors[] — locator/selector text in error message |
Use attempts[].timeline[] as the primary analysis view. The timeline is a unified, offsetMs-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
For each timeline entry:
kind: "step" — test action with title, category, durationMs, depth, optional errorkind: "network" — HTTP request with method, url, status, optional durationMs, resourceType, traceId, failureText, wasAbortedkind: "console" — browser message with type (warning/error/pageerror/page-closed/page-crashed) and textAll entries share offsetMs (milliseconds since attempt start), giving a single temporal view.
If you don't have passing and failing attempts for the same test, skip to 3c.
Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
Common divergence patterns:
Filter tests[] for entries where status is "failed" or flaky is true. For each:
errors[]: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.location: Source file, line, and column — jump straight to the code.attempts[]: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.Each attempt includes:
status and durationMs — spot timing differences between passing and failing attemptserror — failure reason per attempt (may differ across retries)consoleMessages[] — browser warnings/errors (only warning, error, pageerror, page-closed, page-crashed entries; capped at 2KB / 50 per attempt)failureArtifacts — for failed/timed-out/interrupted attempts:
screenshotBase64 — base64-encoded failure screenshot (max 512KB). Decode and inspect this to see exactly what the page showed at failure time — often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.videoPath — path to video recordingnetwork[] — HTTP requests/responses for that attempttimeline[] — unified sorted event streamThe network[] array (on tests or individual attempts) includes:
method, url, status — identify 4xx/5xx responsestimings — detailed breakdown: dnsMs, connectMs, sslMs, sendMs, waitMs, receiveMsdurationMs — total request duration derived from timing componentsrequestHeaders, responseHeaders — allowlisted headersredirectChain — full redirect sequencetraceId — Datadog trace ID extracted from x-datadog-trace-id response header. When present near a failure, you must use references/datadog-apm-traces.md for backend correlation to bridge the gap between frontend test failure and potential backend root cause.Network is capped at 200 entries per attempt, prioritized: fetch/xhr and error responses are retained over static assets. Headers/values capped at 256 chars. If all 200 entries are static assets (script/stylesheet/font) with no API calls, the capture is saturated.
tests[].steps[] provides a step-by-step breakdown of test actions with timing (offsetMs, durationMs, depth). Prefer the timeline view (3a) which interleaves steps with network and console. Use steps directly when you need the full hierarchy (nested steps via depth).
Do not propose a fix without concrete artifacts. At minimum, include:
tests[].errors[] (assertion diff, timeout message) or a trace/log entrytests[].network[] or attempts[].network[] (response status, timing, headers)tests[].location to jump to the sourcefailureArtifacts.screenshotBase64 showing page state at failurenetwork[].traceId showing backend behavior for the failing requestRate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
| Score | Meaning | Criteria |
|---|---|---|
| 5 | Certain | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner) |
| 4 | High confidence | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
| 3 | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly |
| 2 | Low confidence | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix |
| 1 | Speculative | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding |
If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
Apply fixes in this order of priority:
Validate scenario realism first. Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
Test harness fix (when the failure is non-product):
Product fix (when real users would hit the same issue):
Both if user impact exists and tests are fragile.
Lint and type-check touched files
When documenting the fix in a PR or issue, use this structure: