From handson
Automates GUI interactions via screen capture, mouse clicks, typing, scrolling for UI testing, visual verification, and non-browser apps. Bridges Playwright to user browsers using extensions or CDP endpoints.
npx claudepluginhub 3spky5u-oss/handson --plugin handsonThis skill uses the workspace's default tool permissions.
You have eyes and hands. Use the HandsOn MCP tools to see the screen, click, type, scroll, and interact with any application. This skill teaches you how to use them effectively.
CLI for browser automation: navigate sites, snapshot elements for refs, fill forms, click buttons, screenshot, scrape data, test web apps. Chains commands, imports auth state.
Automates desktop control on macOS, Windows, Linux using Midscene vision from screenshots. Handles mouse, keyboard for native apps and Electron without DOM access.
Automates browser tasks via Playwright CLI for AI agents: navigate pages, take snapshots/screenshots, fill forms, click elements from command line. Use with shell access.
Share bugs, ideas, or general feedback.
You have eyes and hands. Use the HandsOn MCP tools to see the screen, click, type, scroll, and interact with any application. This skill teaches you how to use them effectively.
If Playwright MCP tools are available (mcp__plugin_playwright_playwright__browser_*), use them for all in-browser interactions. HandsOn handles everything outside the browser.
By default, Playwright launches its own browser instance. This means Claude and the user are looking at different browsers — Playwright sees its browser, the user sees theirs. Visual verification with HandsOn will show the user's screen, not what Playwright is working on.
To fix this, Playwright can attach to the user's existing browser instead of launching its own. Two approaches:
Option A — Browser Extension (easiest):
--extension:
{ "args": ["@playwright/mcp@latest", "--browser", "msedge", "--extension"] }
Option B — CDP Endpoint:
msedge --remote-debugging-port=9222--cdp-endpoint:
{ "args": ["@playwright/mcp@latest", "--cdp-endpoint", "ws://localhost:9222"] }
Initial connection: When Playwright is configured with --extension, the browser extension shows a connection prompt ("Allow" / "Reject") that must be accepted before Playwright can work. The 30-second CDP timeout means you can't reliably click it manually.
Recommended: Set the PLAYWRIGHT_MCP_EXTENSION_TOKEN env var in the Playwright plugin config to bypass the dialog entirely:
{
"args": ["@playwright/mcp@latest", "--browser", "msedge", "--extension"],
"env": { "PLAYWRIGHT_MCP_EXTENSION_TOKEN": "<token from extension page>" }
}
The token is shown on the extension's connection prompt page. Once set, connections auto-approve.
Fallback: If the token isn't set and the prompt appears, use HandsOn click_text("Allow") to approve — but this only works if the Playwright call hasn't already timed out.
Once bridged, the interaction loop is:
screenshot — see what the user actually sees on screenbrowser_snapshot — get the DOM/accessibility tree for precise interactionbrowser_click / browser_type — interact with web elements via DOMscreenshot — verify the result on the user's actual screenThis gives you DOM precision (Playwright) with visual ground truth (HandsOn).
Always fall back to HandsOn for browser work if:
Don't retry Playwright if it fails — switch to HandsOn immediately and use the Web Form Pattern below. The bridging setup is optional; HandsOn works fine for browser tasks on its own.
To detect if bridging is active, try browser_snapshot — if it shows the same page the user has open (verify with a HandsOn screenshot), you're bridged. If it shows a different page or blank tab, Playwright is running its own browser.
Establish permissions with the user:
Store the answers and operate within the granted scope for the session.
macOS: Ensure your terminal has Accessibility permissions (System Settings > Privacy & Security > Accessibility). HandsOn checks this on startup and warns if not granted.
Always announce intent and show a visible status when interacting with the screen.
Before your first HandsOn tool call in a sequence:
TaskCreate:
subject: Short description of what you're doing — e.g. "Check button placement"activeForm: "Hands On — checking button placement" (always prefix with "Hands On — ")in_progress with TaskUpdateThis displays a live spinner in the UI (e.g. ⟳ Hands On — checking button placement) so the user always knows you're actively looking at their screen and what you're looking for.
When you're done with screen interaction:
completed with TaskUpdateIf you're doing multiple separate interaction sequences in one session, create a new task each time. Don't reuse completed ones.
Before diving in, understand what you're working with:
detect_framework to identify the app's UI toolkit and get automation hintslist_elements to see what's accessible via the accessibility treelist_elements(role="Button") to search the full tree for a specific control typeBy default, work on the user's current desktop. With accessibility targeting and OCR, you can precisely interact with elements without risk of misclicks. Working on the same desktop lets the user see everything you do in real-time.
Use virtual_desktop only when the user requests isolation, or for specific scenarios:
If you do use virtual desktops, the host terminal is automatically pinned to all desktops so the user can always see your output.
Note: On macOS, desktop creation requires manual setup via Mission Control. On Linux, workspace navigation uses existing workspaces.
Set a target window early. Call set_target_window("Browser") (or whatever app you're automating) at the start of a multi-step interaction. This auto-focuses the target before every input action, preventing the terminal from stealing focus between tool calls.
Clear it when switching apps or when done: set_target_window("").
Use send_keys with keyboard shortcuts whenever practical. Shortcuts are faster, more reliable, and don't depend on element coordinates or screen layout:
send_keys("ctrl+t") — don't click the + buttonsend_keys("ctrl+w") — don't click the Xsend_keys("alt+left") — don't find the back buttonsend_keys("ctrl+l") — don't click the URL barsend_keys("ctrl+s") — don't navigate File > Savesend_keys("ctrl+a") then send_keys("ctrl+c")send_keys("tab") — don't click each fieldsend_keys("enter") — don't click Submitsend_keys("alt+tab") as alternative to focus_windowsend_keys("ctrl+t"), then type the URL — don't navigate in an existing tab (you might close the user's content)macOS shortcuts: On macOS, use
cmdinstead ofctrlfor most shortcuts. For example:cmd+t(new tab),cmd+w(close tab),cmd+l(address bar),cmd+s(save),cmd+athencmd+c(select all + copy),cmd+v(paste). Useoptioninstead ofalt(e.g.,option+leftfor navigate back). HandsOn automatically detects the platform.
Reserve mouse clicks for elements that have no shortcut (custom buttons, specific list items, canvas content).
Every interaction follows this pattern:
screenshot → analyze → act → wait_for_change → verify → repeat
screenshot before any interaction. Never act blind.click, type_text, send_keys, scroll, drag, or hover.wait_for_change after actions that modify the UI (clicking buttons, submitting forms, opening menus). Skip for hover or simple mouse moves.click/drag, or from wait_for_change) confirms whether the action succeeded.CRITICAL: Screenshots are downscaled to 1280px max width for transport. All tool coordinates use SCREEN coordinates (physical pixels), NOT screenshot pixels.
find_text, click_text) return screen coordinates — use them directlyfind_element, click_element) return screen coordinates — use them directlyfind_text or find_element to get real coordinatesscreenshot(annotate=True) overlays a grid with screen-coordinate labels for debuggingThe accessibility tree only exposes browser chrome (address bar, tabs, toolbar). Web page content (inputs, buttons, links, text fields, contenteditable areas) is NOT in the accessibility tree.
Do NOT use find_element / click_element / list_elements for web page content — they only see browser UI.
ctrl+l → type URL → enterfind_text to locate it by placeholder/label, click those OCR coordinatesTab / Shift+Tab — NEVER click contenteditable or textarea elements directly, they often don't respond to mouse clicks. Tab is 100% reliable.type_text. Long text → clipboard paste (see below).send_keys("enter") or click_text("Submit")type_text is slow and error-prone for anything over ~50 characters. Use clipboard paste:
clipboard(action="write", text="Your long content here...")
# Tab to the target field (don't try to click it)
send_keys("ctrl+v") # On macOS: send_keys("cmd+v")
Use scroll(x, y, direction, pages=1) for page-at-a-time scrolling. The pages parameter uses PageDown/PageUp keyboard presses internally, which always scrolls exactly one page regardless of DPI or display scaling.
pages=1 — one full page (default when neither pages nor amount is given)pages=0.5 — half page (uses arrow key presses)pages=2 — two pagesamount=N with raw wheel clicks insteadImportant: Mouse wheel scroll (amount) is unreliable on high-DPI displays — the same amount value scrolls different distances depending on DPI scaling. Always prefer pages for predictable scrolling.
| I need to... | Use |
|---|---|
| Targeting | |
| Find a UI element reliably | find_element (by name/role in accessibility tree) |
| Find element with auto-fallback | smart_find (tries UIA first, then OCR automatically) |
| Click a UI element by name | click_element (find + click center) |
| See what's clickable | list_elements (dump accessible element tree) |
| Find all controls of a type | list_elements(role="Button") (full-tree walk filtered by type) |
| Check what has keyboard focus | get_focused_element (verify before typing) |
| Find text on screen (OCR) | find_text (for apps without accessibility) |
| Click visible text (OCR) | click_text (find + click, auto-retries nearby offsets if no response) |
| Identify app framework | detect_framework (returns toolkit + hints) |
| Vision | |
| See what's on screen | screenshot |
| Wait for UI to respond | wait_for_change |
| Debug coordinate issues | screenshot(annotate=True) (grid + crosshair + elements) |
| Get monitor dimensions | get_screen_size |
| Save reference for comparison | screenshot_baseline (capture before a change) |
| See what changed visually | screenshot_diff (highlights diff vs baseline in red overlay) |
| Input | |
| Press a button / select an item | click |
| Enter text in a field | click the field, then type_text |
| Use a keyboard shortcut | send_keys (e.g., "ctrl+s") |
| Navigate a long page/list | scroll |
| Move an element | drag |
| Check a tooltip or hover state | hover |
| Run a click-type-enter sequence | batch_actions (reduces round-trips) |
| Windows & System | |
| Lock focus to one app | set_target_window (auto-focuses before every input) |
| Check current target | get_target_window |
| Switch to a different app | focus_window |
| See what apps are open | list_windows |
| Open an application | launch_app |
| Read text from the UI reliably | clipboard (Ctrl+A, Ctrl+C, then clipboard read) |
| Copy content into the UI | clipboard write, then send_keys "ctrl+v" |
| Work on an isolated desktop | virtual_desktop (create, then close when done) |
| Check current mouse position | get_mouse_position |
| Manage UAC prompts | configure_uac (suppress/restore/status) |
| Monitoring | |
| Watch for new popups/dialogs | start_watcher (background thread polls for new windows) |
| Check what appeared | get_notifications (returns new windows with type + snippet) |
| Stop watching | stop_watcher |
Use this priority order:
find_element / click_element — First choice. Uses the accessibility tree. Fast, precise, DPI-aware. Works on most standard widgets.smart_find — When unsure. Tries accessibility first, falls back to OCR automatically. Reports framework hints when both fail.find_text / click_text — OCR fallback. Use when accessibility can't see the element (custom widgets, canvas content, GTK apps).screenshot(region=...) + visual inspection — Last resort. Crop a region, read coordinates visually, click by position.Never use launch_app("powershell", args="some-command") — the window opens, runs, and immediately closes before you can see output or interact.
Instead:
# Open an interactive shell (stays open):
launch_app("wt") # Windows Terminal (preferred)
launch_app("powershell", args="-NoExit") # PowerShell that stays open
# Run a command AND keep the window open:
launch_app("powershell", args="-NoExit -Command Get-Process")
launch_app("wt", args="powershell -NoExit -Command Get-Process")
# Then type further commands into it:
set_target_window("PowerShell")
type_text("dir\n")
The key is -NoExit — without it, PowerShell runs the command and terminates. Same applies to cmd /k (stays open) vs cmd /c (closes).
batch_actions([
{"action": "click", "x": 100, "y": 200},
{"action": "type", "text": "hello@example.com"},
{"action": "keys", "keys": "tab"},
{"action": "type", "text": "password123"},
{"action": "keys", "keys": "enter"}
])
get_focused_element → verify it's the right field → type_text
list_elements(role="Spinner") → all numeric inputs
list_elements(role="Edit") → all text fields
list_elements(role="ComboBox") → all dropdowns
focus_window to bring the right one forward.wait_for_change(timeout=10). If still stuck, report to user.screenshot to re-establish where you are.detect_framework to understand what toolkit the app uses and check the hints.Call focus_window(title="Claude") as your last action so the user sees you've finished. Without this, the user has no signal that you're done — the target app stays in the foreground and they'll be waiting.
set_target_window at the start of a session to prevent terminal focus-stealing.send_keys("ctrl+t"), send_keys("ctrl+l"), etc. whenever a shortcut exists. Faster and more reliable than clicking.ctrl+t before navigating to a URL. The user may have content open in the current tab.smart_find over find_element — It handles the UIA-to-OCR fallback automatically.batch_actions for sequences — Reduces round-trips for click-type-enter flows.find_text("Save As PDF") matches adjacent words automatically.manage_screenshots(action="cleanup") when you're done with a long session.You maintain a persistent playbook at ~/.claude/handson/playbook.md that accumulates patterns learned from UI automation sessions. This helps you avoid repeating mistakes and get better over time.
At the beginning of any HandsOn session, check if the playbook exists:
~/.claude/handson/playbook.md using the Read toolAfter completing a multi-step GUI workflow that involved:
Entries must be actionable and terse — one line per pattern, grouped by app:
# HandsOn Playbook
<!-- Auto-maintained by Claude. Max 200 lines. Compact when exceeded. -->
## <App Name> (<Browser/Context>)
- <actionable pattern, 1 line>
## General
- <cross-app pattern>
Keep the playbook under 200 lines. When it exceeds 200 lines: