Search everything...

Skill

ui-test

Runs adversarial UI tests in browser via browse CLI. Analyzes git diffs for changes or explores full app. Verifies functional correctness, accessibility, responsive layout, UX heuristics.

Git

npx claudepluginhub browserbase/skills --plugin browserbase-cli

Tool Access

This skill is limited to using the following tools:

Bash Read Glob Grep Agent

Preview

Test UI changes in a real browser. Your job is to **try to break things**, not confirm they work.

Supporting Assets

EXAMPLES.mdLICENSE.txtREADME.mdreferences/adversarial-patterns.mdreferences/browser-recipes.mdreferences/design-consistency.mdreferences/design-system.example.mdreferences/exploratory-testing.mdreferences/parallel-testing.mdreferences/report-template.htmlreferences/ux-heuristics.md

SKILL.md

Similar Skills

ux-audit

689

Audits web app UX by dogfooding as user persona: tracks emotional friction, click efficiency, resilience to back/refresh, return intent. Uses Playwright/Chrome MCP for live sites, outputs ranked reports.

5 files

dev-tools

candid-chrome-qa

Drives Chrome browser sessions against running web apps to detect bugs, UX issues, accessibility violations, performance problems, and copy errors, outputting structured JSON findings for QA passes, smoke tests, or audits.

candid

scoutqa-test

29.8k

Runs ScoutQA CLI for AI-powered exploratory testing of web apps, including smoke tests, accessibility audits, login flows, and e-commerce validation. Proactive after feature implementation.

awesome-copilot-refactor

Stats

Stars568

Forks48

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

ui-test | browserbase-cli | ClaudePluginHub

Back to Skills

Skill

ui-test

From browserbase-cli

Runs adversarial UI tests in browser via browse CLI. Analyzes git diffs for changes or explores full app. Verifies functional correctness, accessibility, responsive layout, UX heuristics.

npx claudepluginhub browserbase/skills --plugin browserbase-cli

Tool Access

This skill is limited to using the following tools:

Bash Read Glob Grep Agent

Preview

Test UI changes in a real browser. Your job is to **try to break things**, not confirm they work.

Supporting Assets

SKILL.md

UI Test — Agentic UI Testing Skill

Test UI changes in a real browser. Your job is to try to break things, not confirm they work.

Three workflows:

Diff-driven — analyze a git diff, test only what changed
Exploratory — navigate the app, find bugs the developer didn't think about
Parallel — fan out independent test groups across multiple Browserbase browsers

How Testing Works

The main agent coordinates — it plans test strategy, delegates to sub-agents, and merges results. Sub-agents do the actual browser testing.

Planning: multiple angles, then execute once

You MUST complete all three planning rounds yourself and output them before launching any sub-agents. Planning happens in your own response — it is NOT delegated to sub-agents. Do not skip ahead to execution.

Round 1 — Functional: What are the core user flows? What should work? Write out each test as: action → expected result.

Round 2 — Adversarial: Re-read Round 1. What did you miss? Think about: different user types/roles, error paths, empty states, race conditions, edge inputs (empty, huge, special chars, rapid clicks).

Round 3 — Coverage gaps: Re-read Rounds 1–2. What about: accessibility (axe-core, keyboard-only), mobile viewports, console errors, visual consistency with the rest of the app?

Deduplicate: Merge all three rounds into one numbered list of tests. Remove overlaps. Assign each test to a group (e.g. Group A, Group B).

Then execute once — launch one sub-agent per group. Each sub-agent receives its specific list of tests to run, nothing more. Sub-agents do not explore or plan — they execute assigned tests and report results.

Output the three rounds, the merged plan, and the group assignments in your response before calling any Agent tool.

Principles for splitting work

Sub-agents run assigned tests, not open exploration. The main agent hands each sub-agent a specific numbered list of tests. Sub-agents do not plan, explore, or decide what to test — they execute the list and stop.
The bottleneck is the slowest agent — split work so no single agent has a disproportionate share. Many small agents > few large ones.
Size the effort to the change — a single component fix doesn't need many agents or many steps. A full-page redesign does. Let the scope of the diff drive the plan.
No early stopping on failures — find as many bugs as possible within the assigned tests.

Giving sub-agents a step budget

The main agent MUST include an explicit browse step limit in every sub-agent prompt. Sub-agents do not self-limit — they will run until done unless told otherwise.

As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category. Adjust based on what the assigned tests actually require — these are starting points, not rules.

Every sub-agent prompt must include:

You have a budget of N browse steps (each `browse` command = 1 step). Count your steps as you go. When you reach N, stop immediately and report:
- STEP_PASS/STEP_FAIL for every test you completed
- STEP_SKIP|<test-id>|budget reached for every test you didn't get to

Do not retry or continue after hitting the budget.
Run only these tests: [numbered list from the merged plan]
Do not explore beyond the assigned tests.
Do NOT generate an HTML report or write any files. Return only step markers and your findings as text.

The main agent should NOT run browse commands itself (except to verify the dev server is up). All testing happens in sub-agents.

When a sub-agent hits its budget, the main agent accepts the partial results as-is. Do not re-run or retry the sub-agent. Include SKIPPED tests in the final report so the developer knows what wasn't covered.

Reporting

Every sub-agent reports back with:

Tests: 8 | Passed: 5 | Failed: 2 | Skipped: 1 | Pages visited: 2

The main agent merges into a final report with:

Tests: 20 | Passed: 14 | Failed: 4 | Skipped: 2 | Agents: 3 | Pass rate: 70%

Do not report "steps used" — browse command counts are implementation plumbing, not a meaningful metric for reviewers.

Testing Philosophy

You are an adversarial tester. Your goal is to find bugs, not prove correctness.

Try to break every feature you test. Don't just check "does the button exist?" — click it twice rapidly, submit empty forms, paste 500 characters, press Escape mid-flow.
Test what the developer didn't think about. Empty states, error recovery, keyboard-only navigation, mobile overflow.
Every assertion must be evidence-based. Compare before/after snapshots. Check specific elements by ref. Never report PASS without concrete evidence from the accessibility tree or a deterministic check.
Report failures with enough detail to reproduce. Include the exact action, what you expected, what you got, and a suggested fix.

Assertion Protocol

Every test step MUST produce a structured assertion. Do not write freeform "this looks good."

Step markers

For each test step, emit exactly one marker:

STEP_PASS|<step-id>|<evidence>

STEP_FAIL|<step-id>|<expected> → <actual>|<screenshot-path>

step-id: short identifier like homepage-cta, form-validation-error, modal-cancel
evidence: what you observed that proves the step passed (element ref, text content, URL, eval result)
expected → actual: what you expected vs what you got
screenshot-path: path to the saved screenshot (failures only — see Screenshot Capture below)

Screenshot Capture for Failures

Every STEP_FAIL MUST have an accompanying screenshot so the developer can see what went wrong visually.

When a test step fails:

# 1. Take a screenshot immediately after observing the failure
browse screenshot --path .context/ui-test-screenshots/<step-id>.png

# If --path is not supported, take the screenshot and save manually:
browse screenshot
# The browse CLI will output the screenshot path — move/copy it:
cp /tmp/browse-screenshot-*.png .context/ui-test-screenshots/<step-id>.png

Setup the screenshot directory at the start of any test run:

mkdir -p .context/ui-test-screenshots

Rules:

File name = step-id (e.g., double-submit.png, axe-audit.png, modal-focus-trap.png)
Store in .context/ui-test-screenshots/ — this directory is gitignored and accessible to the developer and other agents
For parallel runs, include the session name: <session>-<step-id>.png (e.g., signup-double-submit.png)
Take the screenshot at the moment of failure — capture the broken state, not after recovery
For visual/layout bugs, also screenshot the baseline (working state) for comparison: <step-id>-baseline.png

How to verify (in order of rigor)

Deterministic check (strongest) — browse eval returns structured data you can inspect. Examples: axe-core violation count, document.title, form field value, console error array, element count.
Snapshot element match — a specific element with a specific role and text exists in the accessibility tree. Check by ref: @0-12 button "Save". An element either exists in the tree or it doesn't.
Before/after comparison — snapshot before action, act, snapshot after. Verify the tree changed in the expected way (element appeared, disappeared, text changed).
Screenshot + visual judgment (weakest) — only for visual-only properties (color, spacing, layout) that the accessibility tree cannot capture. Always accompany with what specifically you're evaluating.

Before/after comparison pattern

This is the core verification loop. Use it for every interaction:

# 1. BEFORE: capture state
browse snapshot
# Record: what elements exist, their text, their refs

# 2. ACT: perform the interaction
browse click @0-12

# 3. AFTER: capture new state
browse snapshot
# Compare: what changed? What appeared? What disappeared?

# 4. ASSERT: emit marker based on comparison
# If dialog appeared: STEP_PASS|modal-open|dialog "Confirm" appeared at @0-20
# If nothing changed:
browse screenshot --path .context/ui-test-screenshots/modal-open.png
# STEP_FAIL|modal-open|expected dialog to appear → snapshot unchanged|.context/ui-test-screenshots/modal-open.png

Setup

which browse || npm install -g @browserbasehq/browse-cli

Avoid permission fatigue

This skill runs many browse commands (snapshots, clicks, evals). To avoid approving each one, add browse to your allowed commands:

Add both patterns to .claude/settings.json (project-level) or ~/.claude/settings.json (user-level):

{
  "permissions": {
    "allow": [
      "Bash(browse:*)",
      "Bash(BROWSE_SESSION=*)"
    ]
  }
}

The first pattern covers plain browse commands. The second covers parallel sessions (BROWSE_SESSION=signup browse open ...). Both are needed to avoid approval prompts.

Mode Selection

Target	Mode	Command	Auth
`localhost` / `127.0.0.1`	Local	`browse env local`	None needed (clean isolated local browser by default)
Deployed/staging site	Remote	`browse env remote`	cookie-sync → `--context-id`

Rule: If the target URL contains localhost or 127.0.0.1, always use browse env local.

Local Mode (default for localhost)

browse env local
browse open http://localhost:3000

browse env local uses a clean isolated local browser by default, which is best for reproducible localhost QA runs.

Use local-mode variants only when needed:

browse env local --auto-connect — auto-discover existing local Chrome, fallback to isolated. Use this only when the test explicitly needs existing local login/cookies/state.
browse env local <port|url> — attach to a specific CDP target (explicit local browser attach).

Remote Mode (deployed sites via cookie-sync)

# Step 1: Sync cookies from local Chrome to Browserbase
node .claude/skills/cookie-sync/scripts/cookie-sync.mjs --domains your-app.com
# Output: Context ID: ctx_abc123

# Step 2: Switch to remote mode
browse env remote
browse open https://staging.your-app.com --context-id ctx_abc123 --persist
browse snapshot
# ... run tests ...
browse stop

Cookie-sync flags: --domains, --context, --stealth, --proxy "City,ST,US"

Workflow A: Diff-Driven Testing

Phase 1: Analyze the diff

git diff --name-only HEAD~1          # or: git diff --name-only / git diff --name-only main...HEAD
git diff HEAD~1 -- <file>            # read actual changes

Categorize changed files:

File pattern	UI impact	What to test
`.tsx`, `.jsx`, `.vue`, `.svelte`	Component	Render, interaction, state, edge cases
`pages/`, `app/`, `src/routes/**`	Route/page	Navigation, page load, content, 404 handling
`.css`, `.scss`, `*.module.css`	Style	Visual appearance (screenshot), responsive
`form`, `input`, `field`	Form	Validation, submission, empty input, long input, special chars
`modal`, `dialog`, `dropdown`	Interactive	Open/close, escape, focus trap, cancel vs confirm
`nav`, `menu`, `header`	Navigation	Links, active states, routing, keyboard nav
Non-UI files only	None	Skip — report "no UI tests needed"

Phase 2: Map files to URLs

Framework	Default port	File → URL pattern
Next.js App Router	3000	`app/dashboard/page.tsx` → `/dashboard`
Next.js Pages Router	3000	`pages/about.tsx` → `/about`
Vite	5173	Check router config
Nuxt	3000	`pages/index.vue` → `/`
SvelteKit	5173	`src/routes/+page.svelte` → `/`
Angular	4200	Check routing module

Phase 3: Ensure the right code is running

Before testing, verify the dev server is serving the code from the diff — not a stale branch.

If testing a PR or specific branch:

# Check what branch is currently checked out
git branch --show-current

# If it's not the PR branch, switch to it
git fetch origin <branch> && git checkout <branch>

# Install deps — the lockfile may differ between branches
yarn install  # or npm install / pnpm install

If the dev server was already running on a different branch, restart it after checkout.

Find a running dev server:

for port in 3000 3001 5173 4200 8080 8000 5000; do
  s=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:$port" 2>/dev/null)
  if [ "$s" != "000" ]; then echo "Dev server on port $port (HTTP $s)"; fi
done

If nothing found: tell the user to start their dev server.

Verify it actually renders: After browse open + browse snapshot, check that the accessibility tree contains real page content (navigation, headings, interactive elements) — not just an error overlay or empty body. Next.js dev servers can return HTTP 200 while showing a full-screen build error dialog. If the snapshot is empty or dominated by an error dialog, the server is broken — fix the build before testing.

Phase 4: Generate test plan

For each changed area, plan both happy path AND adversarial tests:

Test Plan (based on git diff)
=============================
Changed: src/components/SignupForm.tsx (added email validation)

1. [happy] Valid email submits successfully
   URL: http://localhost:3000/signup
   Steps: fill valid email → submit → verify success message appears

2. [adversarial] Invalid email shows error
   Steps: fill "not-an-email" → submit → verify error message appears

3. [adversarial] Empty form submission
   Steps: click submit without filling anything → verify error, no crash

4. [adversarial] XSS in email field
   Steps: fill "<script>alert(1)</script>" → submit → verify sanitized/rejected

5. [adversarial] Rapid double-submit
   Steps: click submit twice quickly → verify no duplicate submission

6. [adversarial] Keyboard-only flow
   Steps: Tab to email → type → Tab to submit → Enter → verify success

Phase 5: Execute tests

browse stop 2>/dev/null
mkdir -p .context/ui-test-screenshots
# localhost/default QA → clean, reproducible local run
browse env local

For each test, follow the before/after pattern:

# Navigate
browse open http://localhost:3000/path
browse wait load

# BEFORE snapshot
browse snapshot
# Note the current state: elements, refs, text

# ACT
browse click @0-ref
# or: browse fill "selector" "value"
# or: browse type "text"
# or: browse press Enter

# AFTER snapshot
browse snapshot
# Compare against BEFORE: what changed?

# ASSERT with marker
# STEP_PASS|step-id|evidence  OR  STEP_FAIL|step-id|expected → actual

Phase 6: Report results

## UI Test Results

### STEP_PASS|valid-email-submit|status "Thanks!" appeared at @0-42 after submit
- URL: http://localhost:3000/signup
- Before: form with email input @0-3, submit button @0-7
- Action: filled "user@test.com", clicked @0-7
- After: form replaced by status element with "Thanks! We'll be in touch."

### STEP_FAIL|double-submit|expected single submission → form submitted twice|.context/ui-test-screenshots/double-submit.png
- URL: http://localhost:3000/signup
- Before: form with submit button @0-7
- Action: clicked @0-7 twice rapidly
- After: two success toasts appeared, suggesting duplicate submission
- Screenshot: .context/ui-test-screenshots/double-submit.png
- Suggestion: disable submit button after first click, or debounce the handler

---
**Summary: 4/6 passed, 2 failed**
Failed: double-submit, xss-sanitization

Screenshots saved to `.context/ui-test-screenshots/` — open any failed step's screenshot to see the broken state.

Always browse stop when done.

Phase 7: Generate HTML report

After producing the text report, generate a standalone HTML report that a reviewer can open in a browser. The report embeds screenshots inline (base64) so it works as a single file — no external dependencies.

Why: Text reports are good for the agent conversation, but reviewers (PMs, designers, other engineers) want a visual artifact they can open, scan, and share. Screenshots inline make failures immediately obvious.

How to generate

Read the HTML template at references/report-template.html
Build the report by replacing the template placeholders with actual test data:

Placeholder	Value
`{{TITLE}}`	Report title for `<title>` tag (e.g., "UI Test: PR #1234 — OAuth Settings")
`{{TITLE_HTML}}`	Report title for the visible `<h1>`. If a PR URL is available, wrap the PR reference in an `<a>` tag so it's clickable (e.g., `UI Test: <a href="https://github.com/org/repo/pull/1234">PR #1234</a> — OAuth Settings`). If no URL, use plain text same as `{{TITLE}}`.
`{{META}}`	One-line context: date, app URL, user, branch
`{{TOTAL_TESTS}}`	Total STEP_PASS + STEP_FAIL count
`{{AGENT_COUNT}}`	Number of sub-agents that ran
`{{PASS_COUNT}}`	Number of STEP_PASS
`{{FAIL_COUNT}}`	Number of STEP_FAIL
`{{PASS_RATE}}`	Integer percentage (e.g., "92")
`{{RATE_CLASS}}`	`good` (≥90%), `warn` (70–89%), `bad` (<70%)
`{{FAILURES_SECTION}}`	HTML for failed test cards (see below)
`{{PASSES_SECTION}}`	HTML for passed test cards (see below)

For each test result, generate a <details> card. Failed tests should be open by default so reviewers see them immediately:

<!-- Failed test card (open by default) -->
<div class="section">
  <h2>Failures <span class="count">{{FAIL_COUNT}}</span></h2>
  <details class="test-card fail" open>
    <summary>
      <span class="badge fail">FAIL</span>
      <span class="step-id">step-id-here</span>
      <span class="evidence">expected → actual</span>
    </summary>
    <div class="body">
      <dl>
        <dt>URL</dt><dd>http://localhost:3000/path</dd>
        <dt>Action</dt><dd>What was done</dd>
        <dt>Expected</dt><dd>What should have happened</dd>
        <dt>Actual</dt><dd>What happened instead</dd>
      </dl>
      <div class="suggestion">Fix: description of suggested fix</div>
      <div class="screenshot">
        <img src="data:image/png;base64,..." alt="Screenshot of failure">
        <div class="caption">step-id.png — captured at moment of failure</div>
      </div>
    </div>
  </details>
</div>

<!-- Passed test card (collapsed by default) -->
<div class="section">
  <h2>Passed <span class="count">{{PASS_COUNT}}</span></h2>
  <details class="test-card pass">
    <summary>
      <span class="badge pass">PASS</span>
      <span class="step-id">step-id-here</span>
      <span class="evidence">evidence summary</span>
    </summary>
    <div class="body">
      <dl>
        <dt>URL</dt><dd>http://localhost:3000/path</dd>
        <dt>Evidence</dt><dd>What was observed</dd>
      </dl>
    </div>
  </details>
</div>

Embed screenshots as base64 so the HTML is fully self-contained:

# Convert screenshot to base64 data URI
base64 -i .context/ui-test-screenshots/step-id.png | tr -d '\n'
# Use as: src="data:image/png;base64,<output>"

Read each screenshot file referenced in STEP_FAIL markers, base64-encode it, and embed it as an <img src="data:image/png;base64,..."> in the corresponding test card. For STEP_PASS, only embed a screenshot if one was explicitly taken (e.g., baseline screenshots).

Write the final HTML to .context/ui-test-report.html:

# Write the generated HTML
cat > .context/ui-test-report.html << 'REPORT_EOF'
<!DOCTYPE html>
...generated report...
REPORT_EOF

# Open it for the reviewer
open .context/ui-test-report.html  # macOS
# xdg-open .context/ui-test-report.html  # Linux

Tell the user: Report saved to .context/ui-test-report.html and offer to open it.

Rules:

Failures section comes before passes — reviewers care about what's broken first
Failed cards are open by default; passed cards are collapsed
Every STEP_FAIL card MUST have an embedded screenshot — if the screenshot file is missing, note it in the card
Include the suggestion/fix in each failure card if one was provided
The report must work offline — no CDN links, no external assets
Keep the HTML under 5MB — if screenshots push it over, reduce image quality or skip baseline screenshots for passes

Adversarial Test Patterns

Apply these to every interactive element you test. Read references/adversarial-patterns.md for the full pattern library (forms, modals, navigation, error states, keyboard accessibility).

Deterministic Checks

These produce structured data, not judgment calls. Use them as the strongest form of assertion.

Check	What it catches	Assertion
axe-core	WCAG violations	`violations.length === 0`
Console errors	Runtime exceptions, failed requests	empty error array
Broken images	Missing/failed image loads	no images with `naturalWidth === 0`
Form labels	Inputs without accessible labels	every input has `hasLabel: true`

For the exact browse eval recipes, read references/browser-recipes.md.

Workflow B: Exploratory Testing

No diff, no plan — just open the app and try to break it. Use this when the user says "test my app", "find bugs", or "QA this site."

Approach

Discover the app — read package.json to detect the framework, then open the root URL and snapshot to see what's there
Navigate everything — click through nav links, visit every reachable page, note what exists
Test what you find — for each page, apply the adversarial patterns below (forms, modals, navigation, keyboard, error states)
Run deterministic checks — axe-core, console errors, broken images, form labels on every page
Report findings — use STEP_PASS/STEP_FAIL markers, include reproduction steps for failures

Don't try to be systematic about coverage. Just explore like a user would, but with the intent to break things. The agent is good at this — let it roam.

Tips for exploratory runs

Start with the homepage, then follow the navigation naturally
Try the 404 page (/does-not-exist) — is it custom or default?
Look for empty states (pages with no data)
Test forms with garbage input before valid input
Check mobile viewport (375px) on every page — does it overflow?
If the app has auth, use cookie-sync first

Workflow C: Parallel Testing

Run independent test groups concurrently using named browse sessions (BROWSE_SESSION=<name>). Each session gets its own browser. Works with both local and remote mode.

Use when testing multiple pages or categories and you want faster wall clock time.

Read references/parallel-testing.md for the full workflow: session setup, agent fan-out, cookie-sync for auth, and result merging.

Design Consistency

Check whether changed UI matches the rest of the app visually. Read references/design-consistency.md when doing visual or design checks.

Test Categories

Category	How	Assertion type
Accessibility	axe-core + keyboard nav	Deterministic (violation count)
Visual Quality	Screenshot + heuristic evaluation	Visual judgment (weakest — note specifics)
Responsive	Viewport sweep + screenshots	Visual + deterministic (overflow check)
Console Health	Console capture eval	Deterministic (error count)
UX Heuristics	Snapshot + Laws of UX + Nielsen's	Structured judgment (cite specific heuristic)
Error States	Navigate to empty/error states	Before/after comparison
Data Display	Snapshot on tables/dashboards	Element match (column count, formatting)
Design Consistency	Screenshot baseline + changed page comparison	Visual judgment (cite specific property)
Exploratory	Free navigation + adversarial testing	Before/after + judgment

Reference guides (load on demand):

Adversarial patterns — references/adversarial-patterns.md — load when testing forms, modals, navigation, or keyboard a11y
Browser recipes — references/browser-recipes.md — load when running deterministic checks (axe-core, console, images, form labels)
Exploratory testing — references/exploratory-testing.md — load for Workflow B (no diff, open exploration)
UX heuristics — references/ux-heuristics.md — load when evaluating UX quality or citing specific heuristics
Design system — references/design-system.example.md — template for users to customize
Design consistency — references/design-consistency.md — load when doing visual consistency checks
Parallel testing — references/parallel-testing.md — load for Workflow C (concurrent sessions)
Report template — references/report-template.html — HTML template for Phase 7 report generation

For worked examples with exact commands, read EXAMPLES.md if you need to see the assertion protocol in action.

Best Practices

Be adversarial — try to break things, don't just confirm they work
Every assertion needs evidence — snapshot ref, eval result, or before/after diff
Before/after for every interaction — snapshot, act, snapshot, compare
Screenshot every failure — browse screenshot immediately on STEP_FAIL, save to .context/ui-test-screenshots/<step-id>.png
Deterministic checks first — axe-core, console errors, form labels before visual judgment
For localhost, start with clean local mode — use browse env local first for reproducible runs; use --auto-connect only when existing local state is required
Always browse stop when done — for parallel runs, stop every named session
Report failures with reproduction steps — action, expected, actual, screenshot path, suggestion
Parallelize independent tests — use Workflow C with named sessions when testing multiple pages or categories on a deployed site

Troubleshooting

"No active page": browse stop, retry. For zombies: pkill -f "browse.*daemon"
Dev server not responding: curl http://localhost:<port> — ask user to start it
browse eval with await fails: Use .then() instead — browse eval doesn't support top-level await
Element ref not found: browse snapshot again — refs change on page update
Blank snapshot: browse wait load or browse wait selector ".expected" before snapshotting
SPA deep links 404: Navigate to / first, then click through
Remote auth fails: Re-run cookie-sync with --context <id>, try --stealth
Parallel session conflicts: Ensure every browse command uses BROWSE_SESSION=<name> — without it, commands go to the default session
Session not stopping: BROWSE_SESSION=<name> browse stop. For zombies: pkill -f "browse.*<name>.*daemon"

Similar Skills

ux-audit

689

5 files

dev-tools

candid-chrome-qa

candid

scoutqa-test

29.8k

Runs ScoutQA CLI for AI-powered exploratory testing of web apps, including smoke tests, accessibility audits, login flows, and e-commerce validation. Proactive after feature implementation.

awesome-copilot-refactor

Stats

Stars568

Forks48

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

UI Test — Agentic UI Testing Skill

Test UI changes in a real browser. Your job is to try to break things, not confirm they work.

Three workflows:

Diff-driven — analyze a git diff, test only what changed
Exploratory — navigate the app, find bugs the developer didn't think about
Parallel — fan out independent test groups across multiple Browserbase browsers

How Testing Works

The main agent coordinates — it plans test strategy, delegates to sub-agents, and merges results. Sub-agents do the actual browser testing.

Planning: multiple angles, then execute once

Round 1 — Functional: What are the core user flows? What should work? Write out each test as: action → expected result.

Round 3 — Coverage gaps: Re-read Rounds 1–2. What about: accessibility (axe-core, keyboard-only), mobile viewports, console errors, visual consistency with the rest of the app?

Deduplicate: Merge all three rounds into one numbered list of tests. Remove overlaps. Assign each test to a group (e.g. Group A, Group B).

Output the three rounds, the merged plan, and the group assignments in your response before calling any Agent tool.

Principles for splitting work

Sub-agents run assigned tests, not open exploration. The main agent hands each sub-agent a specific numbered list of tests. Sub-agents do not plan, explore, or decide what to test — they execute the list and stop.
The bottleneck is the slowest agent — split work so no single agent has a disproportionate share. Many small agents > few large ones.
Size the effort to the change — a single component fix doesn't need many agents or many steps. A full-page redesign does. Let the scope of the diff drive the plan.
No early stopping on failures — find as many bugs as possible within the assigned tests.

Giving sub-agents a step budget

The main agent MUST include an explicit browse step limit in every sub-agent prompt. Sub-agents do not self-limit — they will run until done unless told otherwise.

Every sub-agent prompt must include:

You have a budget of N browse steps (each `browse` command = 1 step). Count your steps as you go. When you reach N, stop immediately and report:
- STEP_PASS/STEP_FAIL for every test you completed
- STEP_SKIP|<test-id>|budget reached for every test you didn't get to

Do not retry or continue after hitting the budget.
Run only these tests: [numbered list from the merged plan]
Do not explore beyond the assigned tests.
Do NOT generate an HTML report or write any files. Return only step markers and your findings as text.

The main agent should NOT run browse commands itself (except to verify the dev server is up). All testing happens in sub-agents.

Reporting

Every sub-agent reports back with:

Tests: 8 | Passed: 5 | Failed: 2 | Skipped: 1 | Pages visited: 2

The main agent merges into a final report with:

Tests: 20 | Passed: 14 | Failed: 4 | Skipped: 2 | Agents: 3 | Pass rate: 70%

Do not report "steps used" — browse command counts are implementation plumbing, not a meaningful metric for reviewers.

Testing Philosophy

You are an adversarial tester. Your goal is to find bugs, not prove correctness.

Try to break every feature you test. Don't just check "does the button exist?" — click it twice rapidly, submit empty forms, paste 500 characters, press Escape mid-flow.
Test what the developer didn't think about. Empty states, error recovery, keyboard-only navigation, mobile overflow.
Every assertion must be evidence-based. Compare before/after snapshots. Check specific elements by ref. Never report PASS without concrete evidence from the accessibility tree or a deterministic check.
Report failures with enough detail to reproduce. Include the exact action, what you expected, what you got, and a suggested fix.

Assertion Protocol

Every test step MUST produce a structured assertion. Do not write freeform "this looks good."

Step markers

For each test step, emit exactly one marker:

STEP_PASS|<step-id>|<evidence>

STEP_FAIL|<step-id>|<expected> → <actual>|<screenshot-path>

step-id: short identifier like homepage-cta, form-validation-error, modal-cancel
evidence: what you observed that proves the step passed (element ref, text content, URL, eval result)
expected → actual: what you expected vs what you got
screenshot-path: path to the saved screenshot (failures only — see Screenshot Capture below)

Screenshot Capture for Failures

Every STEP_FAIL MUST have an accompanying screenshot so the developer can see what went wrong visually.

When a test step fails:

# 1. Take a screenshot immediately after observing the failure
browse screenshot --path .context/ui-test-screenshots/<step-id>.png

# If --path is not supported, take the screenshot and save manually:
browse screenshot
# The browse CLI will output the screenshot path — move/copy it:
cp /tmp/browse-screenshot-*.png .context/ui-test-screenshots/<step-id>.png

Setup the screenshot directory at the start of any test run:

mkdir -p .context/ui-test-screenshots

Rules:

File name = step-id (e.g., double-submit.png, axe-audit.png, modal-focus-trap.png)
Store in .context/ui-test-screenshots/ — this directory is gitignored and accessible to the developer and other agents
For parallel runs, include the session name: <session>-<step-id>.png (e.g., signup-double-submit.png)
Take the screenshot at the moment of failure — capture the broken state, not after recovery
For visual/layout bugs, also screenshot the baseline (working state) for comparison: <step-id>-baseline.png

How to verify (in order of rigor)

Deterministic check (strongest) — browse eval returns structured data you can inspect. Examples: axe-core violation count, document.title, form field value, console error array, element count.
Snapshot element match — a specific element with a specific role and text exists in the accessibility tree. Check by ref: @0-12 button "Save". An element either exists in the tree or it doesn't.
Before/after comparison — snapshot before action, act, snapshot after. Verify the tree changed in the expected way (element appeared, disappeared, text changed).
Screenshot + visual judgment (weakest) — only for visual-only properties (color, spacing, layout) that the accessibility tree cannot capture. Always accompany with what specifically you're evaluating.

Before/after comparison pattern

This is the core verification loop. Use it for every interaction:

# 1. BEFORE: capture state
browse snapshot
# Record: what elements exist, their text, their refs

# 2. ACT: perform the interaction
browse click @0-12

# 3. AFTER: capture new state
browse snapshot
# Compare: what changed? What appeared? What disappeared?

# 4. ASSERT: emit marker based on comparison
# If dialog appeared: STEP_PASS|modal-open|dialog "Confirm" appeared at @0-20
# If nothing changed:
browse screenshot --path .context/ui-test-screenshots/modal-open.png
# STEP_FAIL|modal-open|expected dialog to appear → snapshot unchanged|.context/ui-test-screenshots/modal-open.png

Setup

which browse || npm install -g @browserbasehq/browse-cli

Avoid permission fatigue

This skill runs many browse commands (snapshots, clicks, evals). To avoid approving each one, add browse to your allowed commands:

Add both patterns to .claude/settings.json (project-level) or ~/.claude/settings.json (user-level):

{
  "permissions": {
    "allow": [
      "Bash(browse:*)",
      "Bash(BROWSE_SESSION=*)"
    ]
  }
}

The first pattern covers plain browse commands. The second covers parallel sessions (BROWSE_SESSION=signup browse open ...). Both are needed to avoid approval prompts.

Mode Selection

Target	Mode	Command	Auth
`localhost` / `127.0.0.1`	Local	`browse env local`	None needed (clean isolated local browser by default)
Deployed/staging site	Remote	`browse env remote`	cookie-sync → `--context-id`

Rule: If the target URL contains localhost or 127.0.0.1, always use browse env local.

Local Mode (default for localhost)

browse env local
browse open http://localhost:3000

browse env local uses a clean isolated local browser by default, which is best for reproducible localhost QA runs.

Use local-mode variants only when needed:

browse env local --auto-connect — auto-discover existing local Chrome, fallback to isolated. Use this only when the test explicitly needs existing local login/cookies/state.
browse env local <port|url> — attach to a specific CDP target (explicit local browser attach).

Remote Mode (deployed sites via cookie-sync)

# Step 1: Sync cookies from local Chrome to Browserbase
node .claude/skills/cookie-sync/scripts/cookie-sync.mjs --domains your-app.com
# Output: Context ID: ctx_abc123

# Step 2: Switch to remote mode
browse env remote
browse open https://staging.your-app.com --context-id ctx_abc123 --persist
browse snapshot
# ... run tests ...
browse stop

Cookie-sync flags: --domains, --context, --stealth, --proxy "City,ST,US"

Workflow A: Diff-Driven Testing

Phase 1: Analyze the diff

git diff --name-only HEAD~1          # or: git diff --name-only / git diff --name-only main...HEAD
git diff HEAD~1 -- <file>            # read actual changes

Categorize changed files:

File pattern	UI impact	What to test
`.tsx`, `.jsx`, `.vue`, `.svelte`	Component	Render, interaction, state, edge cases
`pages/`, `app/`, `src/routes/**`	Route/page	Navigation, page load, content, 404 handling
`.css`, `.scss`, `*.module.css`	Style	Visual appearance (screenshot), responsive
`form`, `input`, `field`	Form	Validation, submission, empty input, long input, special chars
`modal`, `dialog`, `dropdown`	Interactive	Open/close, escape, focus trap, cancel vs confirm
`nav`, `menu`, `header`	Navigation	Links, active states, routing, keyboard nav
Non-UI files only	None	Skip — report "no UI tests needed"

Phase 2: Map files to URLs

Framework	Default port	File → URL pattern
Next.js App Router	3000	`app/dashboard/page.tsx` → `/dashboard`
Next.js Pages Router	3000	`pages/about.tsx` → `/about`
Vite	5173	Check router config
Nuxt	3000	`pages/index.vue` → `/`
SvelteKit	5173	`src/routes/+page.svelte` → `/`
Angular	4200	Check routing module

Phase 3: Ensure the right code is running

Before testing, verify the dev server is serving the code from the diff — not a stale branch.

If testing a PR or specific branch:

# Check what branch is currently checked out
git branch --show-current

# If it's not the PR branch, switch to it
git fetch origin <branch> && git checkout <branch>

# Install deps — the lockfile may differ between branches
yarn install  # or npm install / pnpm install

If the dev server was already running on a different branch, restart it after checkout.

Find a running dev server:

for port in 3000 3001 5173 4200 8080 8000 5000; do
  s=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:$port" 2>/dev/null)
  if [ "$s" != "000" ]; then echo "Dev server on port $port (HTTP $s)"; fi
done

If nothing found: tell the user to start their dev server.

Phase 4: Generate test plan

For each changed area, plan both happy path AND adversarial tests:

Test Plan (based on git diff)
=============================
Changed: src/components/SignupForm.tsx (added email validation)

1. [happy] Valid email submits successfully
   URL: http://localhost:3000/signup
   Steps: fill valid email → submit → verify success message appears

2. [adversarial] Invalid email shows error
   Steps: fill "not-an-email" → submit → verify error message appears

3. [adversarial] Empty form submission
   Steps: click submit without filling anything → verify error, no crash

4. [adversarial] XSS in email field
   Steps: fill "<script>alert(1)</script>" → submit → verify sanitized/rejected

5. [adversarial] Rapid double-submit
   Steps: click submit twice quickly → verify no duplicate submission

6. [adversarial] Keyboard-only flow
   Steps: Tab to email → type → Tab to submit → Enter → verify success

Phase 5: Execute tests

browse stop 2>/dev/null
mkdir -p .context/ui-test-screenshots
# localhost/default QA → clean, reproducible local run
browse env local

For each test, follow the before/after pattern:

# Navigate
browse open http://localhost:3000/path
browse wait load

# BEFORE snapshot
browse snapshot
# Note the current state: elements, refs, text

# ACT
browse click @0-ref
# or: browse fill "selector" "value"
# or: browse type "text"
# or: browse press Enter

# AFTER snapshot
browse snapshot
# Compare against BEFORE: what changed?

# ASSERT with marker
# STEP_PASS|step-id|evidence  OR  STEP_FAIL|step-id|expected → actual

Phase 6: Report results

## UI Test Results

### STEP_PASS|valid-email-submit|status "Thanks!" appeared at @0-42 after submit
- URL: http://localhost:3000/signup
- Before: form with email input @0-3, submit button @0-7
- Action: filled "user@test.com", clicked @0-7
- After: form replaced by status element with "Thanks! We'll be in touch."

### STEP_FAIL|double-submit|expected single submission → form submitted twice|.context/ui-test-screenshots/double-submit.png
- URL: http://localhost:3000/signup
- Before: form with submit button @0-7
- Action: clicked @0-7 twice rapidly
- After: two success toasts appeared, suggesting duplicate submission
- Screenshot: .context/ui-test-screenshots/double-submit.png
- Suggestion: disable submit button after first click, or debounce the handler

---
**Summary: 4/6 passed, 2 failed**
Failed: double-submit, xss-sanitization

Screenshots saved to `.context/ui-test-screenshots/` — open any failed step's screenshot to see the broken state.

Always browse stop when done.

Phase 7: Generate HTML report

How to generate

Read the HTML template at references/report-template.html
Build the report by replacing the template placeholders with actual test data:

Placeholder	Value
`{{TITLE}}`	Report title for `<title>` tag (e.g., "UI Test: PR #1234 — OAuth Settings")
`{{TITLE_HTML}}`	Report title for the visible `<h1>`. If a PR URL is available, wrap the PR reference in an `<a>` tag so it's clickable (e.g., `UI Test: <a href="https://github.com/org/repo/pull/1234">PR #1234</a> — OAuth Settings`). If no URL, use plain text same as `{{TITLE}}`.
`{{META}}`	One-line context: date, app URL, user, branch
`{{TOTAL_TESTS}}`	Total STEP_PASS + STEP_FAIL count
`{{AGENT_COUNT}}`	Number of sub-agents that ran
`{{PASS_COUNT}}`	Number of STEP_PASS
`{{FAIL_COUNT}}`	Number of STEP_FAIL
`{{PASS_RATE}}`	Integer percentage (e.g., "92")
`{{RATE_CLASS}}`	`good` (≥90%), `warn` (70–89%), `bad` (<70%)
`{{FAILURES_SECTION}}`	HTML for failed test cards (see below)
`{{PASSES_SECTION}}`	HTML for passed test cards (see below)

For each test result, generate a <details> card. Failed tests should be open by default so reviewers see them immediately:

<!-- Failed test card (open by default) -->
<div class="section">
  <h2>Failures <span class="count">{{FAIL_COUNT}}</span></h2>
  <details class="test-card fail" open>
    <summary>
      <span class="badge fail">FAIL</span>
      <span class="step-id">step-id-here</span>
      <span class="evidence">expected → actual</span>
    </summary>
    <div class="body">
      <dl>
        <dt>URL</dt><dd>http://localhost:3000/path</dd>
        <dt>Action</dt><dd>What was done</dd>
        <dt>Expected</dt><dd>What should have happened</dd>
        <dt>Actual</dt><dd>What happened instead</dd>
      </dl>
      <div class="suggestion">Fix: description of suggested fix</div>
      <div class="screenshot">
        <img src="data:image/png;base64,..." alt="Screenshot of failure">
        <div class="caption">step-id.png — captured at moment of failure</div>
      </div>
    </div>
  </details>
</div>

<!-- Passed test card (collapsed by default) -->
<div class="section">
  <h2>Passed <span class="count">{{PASS_COUNT}}</span></h2>
  <details class="test-card pass">
    <summary>
      <span class="badge pass">PASS</span>
      <span class="step-id">step-id-here</span>
      <span class="evidence">evidence summary</span>
    </summary>
    <div class="body">
      <dl>
        <dt>URL</dt><dd>http://localhost:3000/path</dd>
        <dt>Evidence</dt><dd>What was observed</dd>
      </dl>
    </div>
  </details>
</div>

Embed screenshots as base64 so the HTML is fully self-contained:

# Convert screenshot to base64 data URI
base64 -i .context/ui-test-screenshots/step-id.png | tr -d '\n'
# Use as: src="data:image/png;base64,<output>"

Write the final HTML to .context/ui-test-report.html:

# Write the generated HTML
cat > .context/ui-test-report.html << 'REPORT_EOF'
<!DOCTYPE html>
...generated report...
REPORT_EOF

# Open it for the reviewer
open .context/ui-test-report.html  # macOS
# xdg-open .context/ui-test-report.html  # Linux

Tell the user: Report saved to .context/ui-test-report.html and offer to open it.

Rules:

Failures section comes before passes — reviewers care about what's broken first
Failed cards are open by default; passed cards are collapsed
Every STEP_FAIL card MUST have an embedded screenshot — if the screenshot file is missing, note it in the card
Include the suggestion/fix in each failure card if one was provided
The report must work offline — no CDN links, no external assets
Keep the HTML under 5MB — if screenshots push it over, reduce image quality or skip baseline screenshots for passes

Adversarial Test Patterns

Apply these to every interactive element you test. Read references/adversarial-patterns.md for the full pattern library (forms, modals, navigation, error states, keyboard accessibility).

Deterministic Checks

These produce structured data, not judgment calls. Use them as the strongest form of assertion.

Check	What it catches	Assertion
axe-core	WCAG violations	`violations.length === 0`
Console errors	Runtime exceptions, failed requests	empty error array
Broken images	Missing/failed image loads	no images with `naturalWidth === 0`
Form labels	Inputs without accessible labels	every input has `hasLabel: true`

For the exact browse eval recipes, read references/browser-recipes.md.

Workflow B: Exploratory Testing

No diff, no plan — just open the app and try to break it. Use this when the user says "test my app", "find bugs", or "QA this site."

Approach

Discover the app — read package.json to detect the framework, then open the root URL and snapshot to see what's there
Navigate everything — click through nav links, visit every reachable page, note what exists
Test what you find — for each page, apply the adversarial patterns below (forms, modals, navigation, keyboard, error states)
Run deterministic checks — axe-core, console errors, broken images, form labels on every page
Report findings — use STEP_PASS/STEP_FAIL markers, include reproduction steps for failures

Don't try to be systematic about coverage. Just explore like a user would, but with the intent to break things. The agent is good at this — let it roam.

Tips for exploratory runs

Start with the homepage, then follow the navigation naturally
Try the 404 page (/does-not-exist) — is it custom or default?
Look for empty states (pages with no data)
Test forms with garbage input before valid input
Check mobile viewport (375px) on every page — does it overflow?
If the app has auth, use cookie-sync first

Workflow C: Parallel Testing

Run independent test groups concurrently using named browse sessions (BROWSE_SESSION=<name>). Each session gets its own browser. Works with both local and remote mode.

Use when testing multiple pages or categories and you want faster wall clock time.

Read references/parallel-testing.md for the full workflow: session setup, agent fan-out, cookie-sync for auth, and result merging.

Design Consistency

Check whether changed UI matches the rest of the app visually. Read references/design-consistency.md when doing visual or design checks.

Test Categories

Category	How	Assertion type
Accessibility	axe-core + keyboard nav	Deterministic (violation count)
Visual Quality	Screenshot + heuristic evaluation	Visual judgment (weakest — note specifics)
Responsive	Viewport sweep + screenshots	Visual + deterministic (overflow check)
Console Health	Console capture eval	Deterministic (error count)
UX Heuristics	Snapshot + Laws of UX + Nielsen's	Structured judgment (cite specific heuristic)
Error States	Navigate to empty/error states	Before/after comparison
Data Display	Snapshot on tables/dashboards	Element match (column count, formatting)
Design Consistency	Screenshot baseline + changed page comparison	Visual judgment (cite specific property)
Exploratory	Free navigation + adversarial testing	Before/after + judgment

Reference guides (load on demand):

Adversarial patterns — references/adversarial-patterns.md — load when testing forms, modals, navigation, or keyboard a11y
Browser recipes — references/browser-recipes.md — load when running deterministic checks (axe-core, console, images, form labels)
Exploratory testing — references/exploratory-testing.md — load for Workflow B (no diff, open exploration)
UX heuristics — references/ux-heuristics.md — load when evaluating UX quality or citing specific heuristics
Design system — references/design-system.example.md — template for users to customize
Design consistency — references/design-consistency.md — load when doing visual consistency checks
Parallel testing — references/parallel-testing.md — load for Workflow C (concurrent sessions)
Report template — references/report-template.html — HTML template for Phase 7 report generation

For worked examples with exact commands, read EXAMPLES.md if you need to see the assertion protocol in action.

Best Practices

Be adversarial — try to break things, don't just confirm they work
Every assertion needs evidence — snapshot ref, eval result, or before/after diff
Before/after for every interaction — snapshot, act, snapshot, compare
Screenshot every failure — browse screenshot immediately on STEP_FAIL, save to .context/ui-test-screenshots/<step-id>.png
Deterministic checks first — axe-core, console errors, form labels before visual judgment
For localhost, start with clean local mode — use browse env local first for reproducible runs; use --auto-connect only when existing local state is required
Always browse stop when done — for parallel runs, stop every named session
Report failures with reproduction steps — action, expected, actual, screenshot path, suggestion
Parallelize independent tests — use Workflow C with named sessions when testing multiple pages or categories on a deployed site

Troubleshooting

"No active page": browse stop, retry. For zombies: pkill -f "browse.*daemon"
Dev server not responding: curl http://localhost:<port> — ask user to start it
browse eval with await fails: Use .then() instead — browse eval doesn't support top-level await
Element ref not found: browse snapshot again — refs change on page update
Blank snapshot: browse wait load or browse wait selector ".expected" before snapshotting
SPA deep links 404: Navigate to / first, then click through
Remote auth fails: Re-run cookie-sync with --context <id>, try --stealth
Parallel session conflicts: Ensure every browse command uses BROWSE_SESSION=<name> — without it, commands go to the default session
Session not stopping: BROWSE_SESSION=<name> browse stop. For zombies: pkill -f "browse.*<name>.*daemon"