From testing
Use when the user wants to live-test a built Windows desktop application (.exe) end-to-end inside the agent-os Windows 11 VM and get a works/doesn't-work + UI/UX report — launching the real binary and driving it, not writing test code. Triggers: "test my Windows app", "QA this .exe", "run my desktop build on agent-os and tell me what breaks", "click through the app", "review the desktop app's UX", "does my exe work", "test the built binary in the Windows VM". Gets the .exe into the VM, launches it, enumerates controls from the UI Automation tree, drives every feature (click/type by element), watches for crashes/hangs/error dialogs, captures screenshots + control-tree dumps, and emits a structured report. Drives the agent-os VM via the Windows-MCP gateway. Sibling of web-app-testing and android-app-testing (shared report format). Does NOT fire for: building/coding a desktop app, the user's personal Windows on steamy (use nircmd), or general agent-os VM driving (use agent-os).
How this skill is triggered — by the user, by Claude, or both
Slash command
/testing:desktop-app-testingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Live, end-to-end testing of a built **Windows `.exe`** inside the agent-os Windows 11 VM: transfer
Live, end-to-end testing of a built Windows .exe inside the agent-os Windows 11 VM: transfer
the build in, launch it, drive every feature, watch for crashes/hangs/error dialogs, review UI/UX,
and emit a structured works/doesn't-work report. Companion to web-app-testing and
android-app-testing — all three share one report format (references/report-format.md).
Drives the agent-os VM (container agent-os-win11, dockur/windows on dookie) through the
agent-os_windows-mcp upstream on the Lab gateway. Builds on the agent-os skill (which is the
general VM driver) but adds the testing harness: build transfer, feature enumeration, failure
taxonomy, evidence pipeline, and the report.
agent-os — general driving of the Windows VM (install software, run PowerShell, one-off tasks).nircmd — the user's PERSONAL Windows on steamy, not the sandbox. Never target steamy here.Every drive action (PowerShell, Click, Type, App, Process, …) is destructive=true and
gated by the gateway. An authenticated admin (lab/lab:admin scope) passes — this was fixed in
lab commit e87940c0. If drive calls return confirmation_required: "...destructive=true...", the
gateway predates the fix; rebuild + redeploy (see references/windows-mcp-calls.md). Read-only
Screenshot/Snapshot are never gated, so a UI/UX review works even if the gate isn't lifted.
If agent-os_windows-mcp isn't a connected upstream (the Lab gateway is in code-mode and its
execute interface isn't exposed to the session, or the server simply isn't wired in), you can
still do the whole pass over plain ssh agent-os: launch + screenshot the native GUI via a
/it scheduled task in the interactive console session (session 0 over SSH has no window station —
a GUI there crashes with os error 1459 and screenshots come back blank), and iterate the frontend
in a browser without rebuilding. Full recipe: references/ssh-fallback-capture.md.
.exe/installer on this host (or a URL the guest can fetch).ssh dookie 'docker ps --format "{{.Names}}" | grep agent-os-win11'. If absent:
ssh dookie 'cd /home/jmagar/compose/windows && docker compose up -d' (boots existing install,
~5 min cold; Windows-MCP auto-starts via an in-guest scheduled task).Screenshot {} — an image back means ready. (Do NOT TCP-probe :8765, false
negative.)Process {mode:"list"}) — if it returns data, the
gate is open; if confirmation_required, fix the gateway first.~/.agents/docs/sessions/<app>-desktop-test/run_<id>/.references/windows-mcp-calls.md): HTTP-pull via
PowerShell Invoke-WebRequest (verified reachable) or SCP to the agent-os guest (scp ... agent-os: / host forward tootie:2222). Unblock-File the
copied binary; pre-create a firewall allow rule if it binds a port.PowerShell {command:"Start-Process 'C:\\...\\app.exe'; ...return PID"}
(Start-menu App {name} is unreliable for arbitrary binaries). Confirm a PID came back.WaitFor {condition:"active_window", window_name:"<title>"}
(short timeout in a retry loop), then Snapshot {} — enumerate menus, buttons, tabs, fields from
the UI Automation tree. Build the feature checklist (merge with any user spec). Write plan.md.Snapshot → pick the target element's integer label → Click {label} / Type {label, text}
→ Snapshot again (UI changed, ids are stale) → repeat. Screenshot between steps for evidence
(doesn't invalidate ids). Use Shortcut for keyboard ops. Type/Click REQUIRE loc or
label — there is no implicit-focus typing.Process {mode:"list", name} shows the PID gone; check
Get-WinEvent -LogName Application for Error/Critical. → FAIL.WaitFor times out / Snapshot shows "(Not Responding)". → FAIL.Snapshot surfaces dialog text; screenshot it. → FAIL/PARTIAL.Process {mode:"kill", name} then relaunch, so input/
mode state doesn't leak across tests.report.md + result.json in the run dir, per
references/report-format.md. Save screenshots to evidence/ and index them.PowerShell Start-Process over App {name} to launch an arbitrary .exe.Snapshot after every UI change — label ids are valid only against the latest Snapshot.Snapshot output overflows the Code Mode envelope (~24KB) — filter/slice the tree text in the
sandbox before returning; don't dump the whole tree.Snapshot
is sparse; fall back to Screenshot + coordinate clicks + SendKeys, and flag reduced confidence.WaitFor in a retry loop.Unblock-File, SEE_MASK_NOZONECHECKS=1, pre-create
firewall rules before expecting hands-off captures.references/windows-mcp-calls.md — verified tool names, params, call patterns, the destructive
gate, build-transfer recipes, evidence capture.references/ssh-fallback-capture.md — SSH-only launch + native capture via a schtasks /it
interactive-session task when Windows-MCP isn't connected, plus an in-process-vite + Edge browser
dev-loop for iterating Tauri/web frontends against a real backend.references/report-format.md — shared cross-platform report spec, run-dir layout, verdicts.npx claudepluginhub jmagar/dendrite --plugin testingCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.