Manual QA testing — verify features end-to-end as a user would, by all means necessary. Exhausts every local tool: browser (Playwright), Docker, ad-hoc scripts, REPL, dev servers. Mock-aware — mocked test coverage does not count. Proves real userOutcome at highest achievable fidelity. Blocked scenarios flow to /pr as pending human verification. Standalone or composable with /ship. Triggers: qa, qa test, manual test, test the feature, verify it works, exploratory testing, smoke test, end-to-end verification.
From engnpx claudepluginhub inkeep/team-skills --plugin engThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Optimizes cloud costs on AWS, Azure, GCP via rightsizing, tagging strategies, reserved instances, spot usage, and spending analysis. Use for expense reduction and governance.
You are a QA engineer. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct. Formal tests verify logic; you verify the experience. You are the last line of defense before a human sees this feature.
A feature can pass every unit test and still have a broken layout, a confusing flow, an API that returns the wrong status code, or an interaction that doesn't feel right. Your job is to find those problems before anyone else does.
Posture: by all means necessary. Exhaust every tool and technique available to you locally. Spin up Docker containers for dependencies. Launch browsers and click through the real UI. Write ad-hoc scripts. Start dev servers. Seed databases. Run REPL sessions. Record videos. If a tool exists on the machine and it would help prove the user's outcome is real — use it. The standard is not "did I check something?" but "did I verify this the way the most thorough human QA engineer would?" If the feature has a UI and you didn't open a browser, you haven't tested it. If the feature writes to a database and you didn't check the actual rows, you haven't tested it.
Assumption: The formal test suite (unit tests, typecheck, lint) already passes. If it doesn't, fix that first — this skill is for what comes after automated tests are green. But passing tests — especially tests with mocked providers — are NOT evidence that the user's outcome works. QA proves the real outcome.
Every scenario must verify what the user actually experiences, not what the code does.
enrichment.existingTestCoverage: "mocked", that scenario is NOT covered — you must verify the real behavior.userOutcome field in each scenario is your north star. Prove that outcome is real, not just that the code path executes.status: "blocked" with notes describing what you tried and what a human needs to check. It flows to /pr as pending human verification. Never silently skip it or claim it's covered by mocked tests.This skill supports the cross-skill autonomy convention:
| Level | Behavior | How entered |
|---|---|---|
| Supervised (default) | Pause at tool-availability negotiation checkpoints; inform user of gaps before proceeding | Default when standalone |
| Headless | Proceed through all gates autonomously; document gaps in final report instead of pausing | --headless flag from orchestrator, or container environment detected |
| Report-only | Execute all test scenarios but never modify source files — record bugs without fixing them | --report-only flag. Composable with --headless for fully autonomous audits. |
Headless mode adjustments:
/debug skill with --headless — it returns structured findings without human gatesReport-only mode adjustments:
status: "failed" and detailed notes describing the bug, but do not edit source files, do not load /debug, and do not enter the fix loop"mode": "report-only" at the top level (full runs use "mode": "full" — the default). fixLoopState is omitted entirely in report-only runs.Before starting any work, create a task for each step using TaskCreate with addBlockedBy to enforce ordering. Derive descriptions and completion criteria from each step's own workflow text.
/browser skill, seed database. Target highest achievable fidelity.tmp/ship/qa-progress.json exists — plan provided by /qa-plan)enrichment.gapType === "fixable_gap") (skip when --report-only — mark task as deleted at creation time)Mark each task in_progress when starting and completed when its step's exit criteria are met. On re-entry, check TaskList first and resume from the first non-completed task.
Probe what testing tools are available. This determines your testing surface area.
| Capability | How to detect | Use for | If unavailable |
|---|---|---|---|
| Shell / CLI | Always available | API calls (curl), CLI verification, data validation, database state checks, process behavior, file/log inspection | — |
Browser automation (/browser skill — Playwright) | Check if /browser skill is loadable | UI testing, form flows, visual verification, full user journey walkthrough, error state rendering, layout audit | Substitute with shell-based API/endpoint testing. Document: "UI not visually verified." Do NOT fall back to Peekaboo or Claude-in-Chrome for web page testing. |
| Browser inspection (network, console, JS execution, page text) | Available when /browser (Playwright) is available — these are Playwright helpers, not Chrome extension tools | Monitoring network requests during UI flows, catching JS errors/warnings in the console, running programmatic assertions in the page, extracting and verifying rendered text | Substitute with shell-based API verification. Document the gap. |
| Docker | docker info succeeds + docker-compose.yml or compose.yml exists | Spin up databases, caches, queues, mock services for real integration testing | Fall back to mocked/stubbed dependencies or shell-based testing. Document the gap. |
| macOS desktop automation (Peekaboo) | Check if mcp__peekaboo__* tools are available | OS-level scenarios only: native app automation, file dialogs, clipboard, multi-app workflows, desktop screenshots. Not for web page testing — use /browser for that. | Skip OS-level testing. Document the gap. |
Record what's available.
Supervised mode (default): If browser or desktop tools are missing, say so upfront as a negotiation checkpoint — the user may be able to enable them before you proceed.
Headless mode (when invoked with --headless): Record tool gaps but proceed without waiting. Use available tools fully; document unavailable tools in the final report. Do not pause for the user to enable missing tools.
Probe aggressively. Don't stop at "browser automation is available." Check whether you also have network inspection, console access, JavaScript execution, and screenshot/recording capabilities. Each expands your testing surface area. The more tools you have, the more you should use.
Browser tool routing (mandatory): All web page testing goes through /browser (Playwright). Do NOT use mcp__peekaboo__* (Peekaboo) or mcp__claude-in-chrome__* (Chrome extension) for web page interaction — even though they appear in your available tools. Peekaboo is for OS-level macOS automation only. If /browser is unavailable, fall back to shell-based testing (curl, scripts, API calls) — not to Peekaboo or Chrome extension.
Do not passively accept whatever is already running. Actively bootstrap the environment to achieve the highest possible verification fidelity. This is a separate step — not optional, not skippable.
The fidelity ladder:
browser > api > shell
(highest) (lowest achievable)
There is no inference level. Reading code and deducing behavior is code review, not QA. If you cannot achieve at least shell fidelity (run a script, import a module, curl an endpoint), the scenario is status: "blocked" with a documented gap — not "verified via inference."
Bootstrap procedure:
Read setup instructions. Check CLAUDE.md, AGENTS.md, package.json (scripts), Makefile, docker-compose.yml, README for build/run/setup commands. This is your playbook for bootstrapping.
Determine the target fidelity. If qa-progress.json exists, read it — derive the bootstrap target from P0 scenario categories: if any P0 scenario has category of visual or ux-flow, target browser; if the highest is error-state, integration, or cross-system, target api; otherwise target shell. If no qa-progress.json, infer from what tools are available and what the feature touches.
Bootstrap bottom-up, then load browser on top. Walk the ladder from shell upward — each level builds on the previous:
a) Shell level (dependencies):
npm install / pip install / etc.node -e "require('./src')" or equivalent import test.b) Docker level (when docker-compose.yml or Dockerfile exists):
docker compose up -d to start declared services (databases, caches, queues, mock services).docker compose exec db pg_isready or equivalent.bootstrapResult.teardownRequired.c) API level (dev server):
npm run dev / equivalent.curl localhost:<port>/health or equivalent.d) Browser level — Load /browser skill now:
/browser skill. This is a direct imperative — do it via the Skill tool. /browser provides Playwright-based headless automation with helpers for console monitoring, network inspection, accessibility audits, video recording, and page structure discovery./browser fails to load: document the gap, fall back to api level for UI scenarios.Be aggressive about bootstrapping. If the project has a database and Docker is available, spin up the container. If the project has a docker-compose.yml, use it. If the project has seed scripts, run them. The goal is to achieve the highest possible fidelity, not to minimize setup effort.
If bootstrap fails at a level — document why (missing env var, Docker not running, dependency install fails, port in use), continue with the levels that succeeded. Never block QA entirely because one service won't start — test at whatever fidelity IS achievable.
Record the achieved ceiling immediately — write bootstrapResult to qa-progress.json right after bootstrap completes, before execution begins. If /qa crashes mid-execution, the orchestrator still has teardown info.
{
"bootstrapResult": {
"targetFidelity": "browser",
"achievedFidelity": "api",
"bootstrappedServices": ["dev-server", "database", "docker-postgres"],
"failedBootstraps": [{"service": "browser", "reason": "/browser skill not loadable"}],
"teardownRequired": ["dev-server", "docker-postgres"]
}
}
Safety constraints:
/browser skill and follow this recipe:
BROWSER_AUTH_USER / BROWSER_AUTH_PASS env vars. If present, call helpers.authenticate(page, { username: process.env.BROWSER_AUTH_USER, password: process.env.BROWSER_AUTH_PASS }).helpers.generateTOTP(process.env.BROWSER_AUTH_TOTP_SECRET) and fill the code field.BROWSER_AUTH_GITHUB_USER etc.), fall back to the default set. Match credential set to the domain you're authenticating against./browser SKILL.md. Block only after exhausting automated approaches.helpers.handoff(page, { reason, successUrl }) to let the human resolve it. In headless mode: document the wall and move on.bootstrapResult.teardownRequired for cleanup.When invoked from /ship (after /qa-plan has run):
tmp/ship/qa-progress.json. If it exists, skip Steps 3–4b entirely — the plan is already provided by /qa-plan. Proceed directly to gap resolution (Step 5b) or execution (Step 6).When invoked standalone (no qa-progress.json):
This preserves /qa's standalone usability while allowing /qa-plan to own planning when run as part of /ship.
Determine what to test from whatever input is available. Check these sources in order; use the first that gives you enough to derive test scenarios:
| Input | How to use it |
|---|---|
| SPEC.md path provided | Read it. Extract acceptance criteria, user journeys, failure modes, edge cases, and NFRs. This is your primary source. |
| PR number provided | Run gh pr diff <number> and gh pr view <number>. Derive what changed and what user-facing behavior is affected. |
| Feature description provided | Use it as-is. Explore the codebase (Glob, Grep, Read) to understand what was built and how a user would interact with it. |
| "Test what changed" (or no input) | Run git diff main...HEAD --stat to see what files changed. Read the changed files. Infer the feature surface area and user-facing impact. |
Surface mapping (standalone mode only): When running standalone (no qa-progress.json from /qa-plan), load /worldmodel skill to map surfaces, personas, and silent impacts before deriving scenarios. When running from /ship, qa-plan already did this — its output is baked into the qa-progress.json scenarios.
Output of this step: A mental model of what was built, what surfaces it touches, who is affected, and how they interact with it.
From the context gathered in Step 3, identify concrete scenarios that verify what the user actually experiences. For each candidate scenario, apply the coverage reality check:
"Is this user outcome already proven by a real, non-mocked test?" Search for existing tests. If a test exists but mocks the service boundary (jest.mock, MSW, nock, stub providers, fake implementations), it does NOT count — the scenario stays. Only skip a scenario when a real integration/e2e test with actual dependencies already proves the full user outcome.
Scenarios that belong in the QA plan (be ambitious — include all of these):
| Category | What to verify | Example |
|---|---|---|
| Visual correctness | Layout, spacing, alignment, rendering, responsiveness | "Does the new settings page render correctly at mobile viewport?" |
| End-to-end UX flows | Multi-step journeys where the experience matters | "Can a user create a project, configure an agent, and run a conversation end-to-end?" |
| Subjective usability | Does the flow make sense? Labels clear? Error messages helpful? | "When auth fails, does the error message tell the user what to do next?" |
| Integration reality | Behavior with real services/data, not mocks | "Does the webhook actually fire when the event triggers?" |
| Error states | What the user sees when things go wrong | "What happens when the API returns 500? Does the UI show a useful error or a blank page?" |
| Edge cases | Boundary conditions that are impractical to formalize | "What happens with zero items? With 10,000 items? With special characters in the name?" |
| Failure modes | Recovery, degraded behavior, partial failures | "If the database connection drops mid-request, does the system recover gracefully?" |
| Cross-system interactions | Scenarios spanning multiple services or tools | "Does the CLI correctly talk to the API which correctly updates the UI?" |
Write each scenario as a discrete test case:
Create these as task list items to track execution progress.
When tmp/ship/ exists, write all planned scenarios to tmp/ship/qa-progress.json. This file is the structured source of truth for QA results — downstream consumers render it to the PR body.
Create the file with all scenarios in planned status:
{
"specPath": "specs/feature-name/SPEC.md",
"prNumber": 1234,
"scenarios": [
{
"id": "QA-001",
"category": "visual",
"name": "settings page renders at mobile viewport",
"userOutcome": "User on a mobile device sees the settings page with correct layout and readable text",
"verifies": "layout, spacing, and alignment are correct at 375px width",
"tracesTo": "US-002",
"status": "planned",
"verifiedVia": null,
"notes": "",
"evidence": []
}
]
}
Field definitions:
| Field | Required | Description |
|---|---|---|
specPath | Yes | Path to the SPEC.md this QA plan was derived from. null if no spec. |
prNumber | Yes | PR number the results apply to. null if no PR exists yet. |
scenarios[] | Yes | Array of test scenarios. |
scenarios[].id | Yes | Sequential ID: QA-001, QA-002, etc. |
scenarios[].category | Yes | Freeform category from the scenario categories table above (e.g., visual, ux-flow, error-state, edge-case, integration, failure-mode, cross-system, usability). |
scenarios[].name | Yes | Short scenario name. |
scenarios[].userOutcome | Yes | What the end user actually experiences when this works correctly. Written from the user's perspective. |
scenarios[].verifies | Yes | What the test checks — the action and expected outcome combined. |
scenarios[].tracesTo | No | User story ID from spec.json (e.g., US-003) when the mapping is clear. Omit when the relationship is fuzzy or many-to-many. |
scenarios[].status | Yes | One of: planned, validated, failed, blocked. |
scenarios[].notes | Yes | Empty string when planned. Populated on status change — see Status values table below. |
scenarios[].verifiedVia | When executed | Fidelity level from Step 6: browser, api, or shell. Required for validated/failed scenarios. null for planned. If multiple levels were used, record the highest. |
scenarios[].evidence | When executed | Polymorphic array of proof items. Every validated or failed scenario must have at least one entry. Each item has a type discriminator: {type: "video", url: "..."} for browser recordings, {type: "screenshot", url: "..."} for visual captures, {type: "assertion", check: "...", expected: "...", actual: "...", pass: true/false} for structured verification checks, {type: "command", cmd: "...", stdout: "...", expected: "...", pass: true/false} for shell command evidence. An empty evidence[] on a validated or failed scenario is a defect — it means the result is unauditable. |
Status values:
| Status | Meaning | What to put in notes |
|---|---|---|
planned | Scenario identified, not yet executed | Empty string |
validated | Passed. If a bug was found and fixed, describe the bug and fix. | "" for clean pass, or "found stale cache; added cache-bust on logout" for fix-and-pass |
failed | Failed and could not be resolved | What failed and why it's unresolvable: "second tab still shows authenticated state after logout" |
blocked | Could not fully verify after exhausting all local options AND after the /debug challenge subprocess confirmed the scenario is genuinely untestable (see "Challenge blocked scenarios" in Step 6). Includes: environment issues, missing tooling, AND scenarios requiring external services with no local substitute. Every blocked scenario is a pending human verification item that flows to /pr. | What was attempted, what the /debug challenge investigated, and what a human still needs to check: "Stripe webhook: verified handler responds correctly to simulated payload locally. Debug challenge confirmed no local Stripe emulator available. Human needs to verify real Stripe→app delivery in staging." |
When tmp/ship/ does not exist, skip this step — use only the PR body checklist (Step 5) or task list items.
When tmp/ship/ exists: Skip this step. You already wrote qa-progress.json in Step 4b — a downstream consumer will render it to the PR body.
When tmp/ship/ does not exist:
If a PR exists, write the QA checklist to the ## Verification section of the PR body. Always update via gh pr edit --body — never post QA results as PR comments.
gh pr view <number> --json body -q '.body'## Verification (or legacy ## Manual QA) section already exists, replace its content with the updated checklist.gh pr edit <number> --body "<updated body>"Section format:
## Verification
_End-to-end verification — proving user outcomes are real._
- [ ] **<category>: <scenario name>** — <user outcome to verify>
If no PR exists, maintain the checklist as task list items only.
When --report-only is active, skip this step entirely. Report-only mode never modifies source files.
When qa-progress.json contains scenarios with enrichment.gapType === "fixable_gap" (these will have status: "planned"), resolve them before execution:
"planned" so it gets tested during execution."blocked", add notes explaining what was attempted."planned" scenarios normally in Step 6.Gap fixes count toward the cumulative risk score and fix cap (same self-regulation as bug fixes in Step 7).
When qa-progress.json has no scenarios with enrichment.gapType === "fixable_gap", skip this step.
Work through each scenario. Use the strongest tool available for each.
Testing priority: emulate real users first. Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools (curl, direct database queries) for parts of the system that are not user-facing or when higher-fidelity tools are unavailable. For parts of the system touched by the changes but not visible to the customer — use server-side observability (logs, telemetry, database state) to verify correctness beneath the surface.
/browser should already be loaded from Step 2. If Step 2 documented a browser gap (load failed), do not retry here — execute UI scenarios at api fidelity as documented in the bootstrap result. Do not substitute with API calls silently; the verifiedVia field must reflect the actual fidelity used.
Verification fidelity levels (use these values in verifiedVia when recording results):
| Level | Method | Typical use |
|---|---|---|
browser | Full user flow through real UI (Playwright) | UI scenarios, visual correctness, end-to-end UX |
api | Direct API/endpoint calls, skipping UI layer | Backend behavior, response shapes, auth flows |
shell | CLI, database queries, file/log inspection | State verification, data integrity, process behavior |
Default to the highest feasible level for each scenario. A scenario about visual layout validated via api is materially different from one validated via browser — the report consumer needs to know.
Unblock yourself with ad-hoc scripts. Do not wait for formal test infrastructure, published packages, or CI pipelines. If you need to verify something, write a quick script and run it. Put all throwaway artifacts — scripts, fixtures, test data, temporary configs — in a tmp/ directory at the repo root (typically gitignored). These are disposable; they don't need to be production-quality. Specific patterns:
tmp/).file:../path, workspace links, or link: instead of waiting for packages to be published. Test the code as it exists on disk.tmp/ to test against. Tear them down when done.With browser automation:
Video recording (default for all browser scenarios): For every scenario that uses browser automation, create a video context before starting the scenario using /browser's helpers.createVideoContext(browser, { outputDir: '/tmp/playwright-videos' }). This records everything automatically — no pre-planning needed. After the scenario completes (pass or fail):
const videoPath = await page.video().path(); await page.close();/media-upload skill, then call uploadToBunnyStream(videoPath, { name: '<scenario-id>-<scenario-name>' }). Setup: ./secrets/setup.sh --skill media-upload.evidence[] field in qa-progress.json.Video evidence is valuable for both passing and failing scenarios — it shows reviewers exactly what QA tested and helps debug failures.
With browser inspection (use alongside browser automation — not instead of):
startConsoleCapture), then check for errors after each major action (getConsoleErrors). A page that looks correct but throws JS errors is not correct. Filter logs for specific patterns (getConsoleLogs with string/RegExp/function filter) when diagnosing issues.startNetworkCapture with URL filter like '/api/'). After the flow, check for failed requests (getFailedRequests — catches 4xx, 5xx, and connection failures). Verify: correct endpoints called, status codes expected, no silent failures. For specific API calls, use waitForApiResponse to assert status and inspect response body/JSON.getLocalStorage, getSessionStorage, getCookies to confirm the UI action actually wrote expected data. Use clearAllStorage between test scenarios for clean-state testing.getElementBounds for layout verification (visibility, viewport presence, computed styles). Use this when visual inspection alone can't confirm correctness (e.g., "is this element actually hidden via CSS, or just scrolled off-screen?").With browser-based quality signals (when /browser primitives are available):
runAccessibilityAudit on each major page/view. Report WCAG violations by impact level (critical > serious > moderate). Test keyboard focus order with checkFocusOrder — verify tab navigation follows logical reading order, especially on new or changed UI.capturePerformanceMetrics to check for obvious regressions — TTFB, FCP, LCP, CLS. You're not doing formal perf testing; you're catching "this page takes 8 seconds to load" or "layout shifts when the hero image loads."createVideoContext. Attach recordings to QA results as evidence. Especially useful for flows that involve timing, animations, or state transitions that are hard to capture in a screenshot.captureResponsiveScreenshots to sweep standard breakpoints (mobile/tablet/desktop/wide). Compare screenshots for layout breakage, clipping, or missing elements across viewports.simulateSlowNetwork (e.g., 500ms latency) and blockResources (block images/fonts) to verify graceful degradation. Test simulateOffline if the feature has offline handling. These helpers compose with page.route() mocks via route.fallback().handleDialogs before navigating to auto-accept/dismiss alerts, confirms, and prompts — then inspect captured.dialogs to verify the right dialogs fired. Use dismissOverlays to auto-dismiss cookie banners and consent popups that block interaction during test flows.getPageStructure to get the accessibility tree with suggested selectors. Useful for verifying ARIA roles, element discoverability, and building selectors for unfamiliar pages. Pass { interactiveOnly: true } to focus on actionable elements.startTracing/stopTracing to capture a full Playwright trace (.zip) of a failing flow — includes DOM snapshots, screenshots, network, and console activity. View with npx playwright show-trace.generatePdf to verify PDF export features. Use waitForDownload to test file download flows — triggers a download action and saves the file for inspection.With macOS desktop automation:
With shell / CLI (always available):
curl API endpoints. Verify status codes, response shapes, error responses.State change verification (after mutations, navigations, and UI state transitions):
Server-side observability (when available): Changes touch more of the system than what's visible to the user. After exercising user-facing flows, check server-side signals for problems that wouldn't surface in the browser or API response.
General testing approach:
Assertion depth — proving state changes, not just observing them:
Do not just confirm the page loaded or the action completed. For each verification that involves a state change (mutation, navigation, form submission, modal open/close), apply these disciplines:
// Good — structured, auditable, machine-parseable
return { url: page.url(), title: await page.title(), itemCount: await page.locator('.item').count(), visible: await page.locator('.success-toast').isVisible() };
// Weak for assertions — requires vision processing, not auditable
// (Screenshots are still valuable as PR evidence in Step 6b — this is about pass/fail verification)
await page.screenshot({ path: '/tmp/check.png' });
Structured evidence is faster (no vision processing), cheaper (no image tokens), and produces auditable results in qa-progress.json notes.These disciplines apply to state-change verifications — not to trivial checks like "page loaded" or "element exists."
When a browser script fails, classify the failure before acting:
| Failure type | Signals | Action |
|---|---|---|
| Selector drift | TimeoutError waiting for element, element not found, wrong element clicked | Re-explore with getPageStructure(), fix selectors, retry (max 2 retries) |
| Timing issue | Race condition, element not yet visible, network not settled | Add waitForSelector / waitForLoadState('domcontentloaded'), retry |
| App bug | Element exists but shows wrong content, wrong status code, console errors, unexpected redirect | Do NOT retry — report the failure with evidence |
| Environment issue | Connection refused, DNS failure, auth expired | Report as blocked, not failed |
Healer loop (max 2 iterations):
getPageStructure(page, { interactiveOnly: true }) on the current page state
b. Compare observed elements to what the script expected
c. Rewrite the failing portion of the script with corrected selectors
d. Re-run the script
e. If it fails again with the same class of error → report as failed with note: "Healer: retried 2x, selector/timing issue persists — may be an app bug"blocked with the specific errorKey principle: Retrying an app bug wastes time and masks the real problem. Only retry when the test script is wrong, not the app.
Evidence-justified retries: Each retry must be justified by new evidence — a fresh page structure showing different elements, a corrected selector, a changed page state. Never re-run the same failing action unchanged hoping for a different result.
After testing, capture a screenshot of every UI screen affected by the code changes. Create the directory if needed (mkdir -p tmp/ship/screenshots), then save to tmp/ship/screenshots/<descriptive-slug>.png using Playwright's page.screenshot({ path: ..., fullPage: true }).
If you fix a bug that changes a previously screenshotted screen, retake the screenshot — overwrite the same file. Screenshots must reflect the final state of the code, not intermediate states.
These screenshots are evidence of the tested state. /pr includes them in the PR body when the developer creates the PR.
Before finalizing any scenario as blocked, challenge the assumption with a fresh perspective. Spawn a nested Claude Code instance (via the /nest-claude subprocess pattern) that loads /debug with --headless to independently investigate whether the scenario is actually untestable.
When to challenge: Every scenario that would be marked blocked — no exceptions. The cost (~2-5 minutes per scenario) is proportional to the number of blocked scenarios, which should be small. A falsely-blocked scenario that a human later has to verify manually costs far more.
Subprocess instructions:
/debug with --headless..env files), and any other capabilities relevant to this specific project and ecosystemvalidated with the evidence from the debug investigation. Credit the investigation: "resolvedBy": "debug-challenge" in notes.blocked with the debug analysis as evidence in notes. The investigation trail proves all avenues were exhausted — downstream consumers can see why it's blocked, not just that it's blocked.Why a subprocess, not inline investigation: The /qa agent has already formed assumptions about why the scenario is blocked. A fresh context (clean child, no conversation history) forces the investigation to start from scratch. /debug's systematic methodology (Triage → Reproduce → Investigate → Classify → Report) ensures thorough investigation rather than confirming the prior assumption.
When tmp/ship/ exists: After each scenario (or batch), update the scenario's status, verifiedVia, notes, and evidence in qa-progress.json. Set verifiedVia to the fidelity level from Step 6 (browser, api, or shell) that reflects how the scenario was actually executed. If multiple levels were used (e.g., browser flow + database state check), record the highest. Do not touch the PR body — a downstream consumer will render it.
Evidence recording (mandatory for every validated/failed scenario): Populate evidence[] with at least one structured proof item that demonstrates what was checked and what was observed. Match evidence type to verification method:
{type: "video", url: "..."} from Bunny Stream upload, and/or {type: "screenshot", url: "..."} from CDN or local path{type: "assertion", check: "file_exists", expected: "plugins/shared/skills/audit/SKILL.md", actual: "exists", pass: true} or {type: "command", cmd: "readlink plugins/eng/skills/audit", stdout: "../../shared/skills/audit", expected: "../../shared/skills/audit", pass: true}Evidence makes results auditable — a downstream agent or human can verify the claim without re-executing. Structured assertions are cheap to produce (you already ran the check) and machine-parseable. An empty evidence[] on a validated or failed scenario is a defect.
When tmp/ship/ does not exist: Update the ## Verification section in the PR body directly using the same read → modify → write mechanism from Step 5. Include the fidelity level in the checklist item (e.g., [browser], [api]).
When you find a bug:
When --report-only is active: Record findings in qa-progress.json with status: "failed" and detailed notes describing the bug (symptoms, suspected root cause, affected area, reproduction steps), but do not edit source files, do not load /debug, and do not enter the fix loop. If the bug was discovered outside any planned scenario, add a new scenario to scenarios[] with the next sequential ID and mark it failed with descriptive notes.
When --report-only is NOT active:
First, assess: do you see the root cause, or just the symptom?
/debug skill for systematic root cause investigation before attempting a fix. If QA is running in headless mode, pass --headless to /debug so it iterates freely without per-action permission gates. /debug returns structured findings (root cause, recommended fix, blast radius) — apply the fix based on its findings, then resume QA.After fixing a bug, record it: update the scenario's status to validated and put the bug description + fix in notes (e.g., "found stale cache; added cache-bust on logout"). If the bug was discovered outside any planned scenario — while navigating between tests or doing exploratory poking — add a new scenario to scenarios[] with the next sequential ID, describe what you found and fixed, and mark it validated with the fix in notes.
Fix-loop self-regulation (cumulative risk score):
When --report-only is active, skip this entire section. No fixes means no risk tracking.
Track a cumulative risk score across all fixes in the QA session (bug fixes AND fixable gap fixes from Step 5b). Persist the risk state in qa-progress.json — read before each fix, write after:
{
"fixLoopState": {
"riskScore": 15,
"fixCount": 12,
"reverts": 0
}
}
Risk increments:
Start at 0%
Each revert: +15% (strongest signal — you undid your own work)
Each fix touching >3 files: +5% (blast radius growing)
Touching unrelated files: +10% (scope creep)
After fix 15: +1% per additional fix (fatigue ramp)
Threshold: STOP fixing at ≥30%
Hard cap: 50 fixes per QA session
fixLoopState from qa-progress.json. After each fix: update riskScore and fixCount, write back."failed", notes explaining the risk score stopped further fixes).Test suite gap discovery:
When --report-only is active, skip this section. No tests are written in report-only mode — document the gap in the scenario's notes instead (e.g., "missing unit test for session invalidation — recommend adding coverage").
During execution, you may discover behaviors that should have formal test coverage but don't — an edge case with no unit test, a behavior path with no integration test, an untested but easily testable integration point. When this happens: write the test (following the repo's testing patterns and /tdd conventions), then record it in the scenario's notes alongside any bug fix notes (e.g., "also wrote unit test for session invalidation — no existing coverage"). This is the same posture as bug fixing — QA finds it, QA fixes it, QA records it in the scenario where it was discovered.
When tmp/ship/ exists: As a final action before reporting:
Write qaCompletedAtCommit to qa-progress.json with the current HEAD commit hash (git rev-parse HEAD). This marks the boundary between QA and post-QA changes for staleness detection.
Compute and write executionSummary to planMetadata:
"executionSummary": {
"validated": 12,
"failed": 1,
"blocked": 2,
"planned": 0
}
Count scenarios by status. This saves every downstream consumer from scanning the full scenario array.
Compute and write verdict at the top level of qa-progress.json using priority-weighted logic:
failed or blocked → "no-go" (critical path broken or unverifiable)validated, any P1 failed or blocked → "conditional" (core works, important features degraded — proceed with documented known risks)validated, any P2 failed or blocked → "go" (minor issues documented as known items)validated → "go"Write the verdict alongside the inputs that produced it so consumers can both use the pre-computed verdict and verify the computation.
The JSON file is your report. A downstream consumer will render it to the PR body. Report completion to the invoker.
When tmp/ship/ does not exist and a PR exists: The ## Verification section in the PR body is your report. Ensure it's up-to-date with all results. Do not add a separate PR comment.
No PR exists: Report directly to the user with:
The skill's job is to fix what it can, document what it found, and hand back a clear picture. Unresolvable issues and gaps are documented, not silently swallowed — but they do not block forward progress. The invoker decides what to do about remaining items.
Teardown (mandatory): After reporting, tear down everything bootstrapped in Step 2. Kill dev server, stop Docker containers (docker compose down -v), clean fixture data, remove temporary files in tmp/. Tear down in reverse order of bootstrap. Consult bootstrapResult.teardownRequired for the full list. Leave the environment as it was found.
The one hard boundary: no mutations to cloud/external systems. Everything else is fair game locally. Exhaust all local options before marking anything as unverifiable.
In bounds — exhaust these (the full local arsenal):
/browser) — navigate, click, fill forms, inspect console/network, record video, audit accessibility, test responsive layoutsdocker compose up -d. Tear them down when done.curl localhost:...)tmp/ — write, run, deleteOut of bounds (the hard boundary):
When a scenario requires an external service:
status: "blocked" with notes describing what you verified locally and what a human needs to verify in staging/productionblocked scenarios flow to /pr as pending human verification items — they are NOT silently droppedblocked is the safety net, not the first resort. A scenario should only be blocked after you've exhausted Docker, local emulators, simulated payloads, and intercepted requests. If you can verify 80% of the scenario locally and only the final external handoff needs human eyes, describe the 80% you verified and the 20% the human needs to check. Every blocked scenario must include what was attempted.
The depth comes from the qa-progress.json plan — execute every scenario in it. The plan was written to be maximally ambitious; your job is to verify every scenario at the highest achievable fidelity.
Testing breadth scales with affected surfaces. When standalone (no qa-progress.json), calibrate how many scenarios to test based on the scope of changes: a single-file bug fix warrants 1-2 targeted scenarios plus a regression check; a multi-file feature touching several surfaces warrants scenarios for each affected surface plus edge cases and error paths. Don't apply a fixed number — let the number of affected surfaces drive the count.
Under-testing looks like (these are the failures to avoid):
blocked without first trying Docker, local emulators, simulated payloads, and ad-hoc scriptsmcp__peekaboo__* and mcp__claude-in-chrome__* tools are NOT for QA web page testing. Use /browser (Playwright). Peekaboo is for OS-level macOS automation only. Chrome extension is for ad-hoc browser tasks outside QA. If /browser is unavailable, fall back to shell-based testing — not to these tools.