From screen-mcp
See and operate the user's real, physical Linux desktop through the mcp-screen MCP — screenshot any monitor and click/type/scroll/drag any visible app. Use this whenever the user points at something on their monitor(s), desktop, or an open window and wants you to read it or act on it: "what's on my second/left monitor", "which app is focused and what does it say", "read this dashboard/dialog/popup/terminal that's up", "a box popped up — click cancel", "open Slack and check if X replied", "log into this site in the Firefox window and check a tab", "did the build in that terminal pass", "see if you can read X". Covers reading on-screen GUI content across multiple monitors and driving ANY app (Slack, browser, terminal, settings) by clicking, typing, logging in, or scrolling. Prefer it over app/web APIs (Slack, Grafana, browser tools) when the user is clearly looking at their own screen. NOT for pasted/attached images, local files, shell commands, or remote SSH/TUI sessions. Encodes the fast reliable loop + hard-won gotchas so you don't relearn them.
How this skill is triggered — by the user, by Claude, or both
Slash command
/screen-mcp:drive-screenThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A loop, not a single call: **locate → ground → act → confirm**. The tools are `screen_screenshot`, `screen_click`, `screen_type`, `screen_key`, `screen_scroll`, `screen_drag`, `screen_move_mouse`, `screen_read_page`, `screen_do` (batch), `screen_tour`, `screen_wait`, `screen_diag`, `screen_reload`, `screen_session`.
A loop, not a single call: locate → ground → act → confirm. The tools are screen_screenshot, screen_click, screen_type, screen_key, screen_scroll, screen_drag, screen_move_mouse, screen_read_page, screen_do (batch), screen_tour, screen_wait, screen_diag, screen_reload, screen_session.
screen_screenshot() with no args = full multi-monitor overview. Use it ONLY to find where the target is and which monitor. It's the slowest shot — don't loop on it.screen_screenshot(region=[x,y,w,h]) to zoom in crisp, or add annotate=true to get OmniParser-numbered elements with exact desktop(x,y) click coords. A small region shot is the fastest and sharpest read.screen_click/screen_type/etc. Default space='view' uses the coords as seen in the latest screenshot. To click a grounded element, pass element=<id> from the last annotate=true (server resolves exact coords — no guessing). The screenshot is ground truth: whatever is shown at a pixel, a click at that pixel lands there (1:1, verified). A click that seems to "miss" is almost never the app and never a coordinate offset — it's acting on coords from an older screenshot than the one currently in effect (see "Coords belong to ONE screenshot").SENSE line in responses: it tells you what changed, if a modal opened (deal with it first), or "nothing changed" = your action was a no-op/misclick (re-ground and retry). If a static-monitor read looks stale (content you expect isn't there), pass fresh=true to force a current frame — it nudges the pointer, so it's not on by default.region=[x,y,w,h] shots are ~100–300ms and razor-sharp; the full composite is ~1s and downscaled. Locate with full, then work in regions.screen_wait/sleeps after an action before screenshotting — the screenshot already settled the frame (and force-refreshes once if a static monitor didn't emit the post-action frame). Use settle=0 for the raw instantaneous frame. Forcing freshness nudges the pointer (a visible flash), so it's deliberately sparing — reach for fresh=true only when a static-monitor read actually looks stale, not routinely.annotate=true + element=<id> beats eyeballing pixel coords. Re-ground if a SENSE "nothing changed" comes back.screen_read_page auto-scrolls and accumulates — don't hand-loop scroll+screenshot.screen_do=[...] to cut round-trips; survey several screens with screen_tour.screen_diag first whenever capture/clicks/cursor act up — it shows live geo, per-monitor power/live, cursor, and grounding health.Every screenshot prints a view#N in its text and maps view-space coords through that shot's origin/scale. The transform is a single slot that every new screenshot overwrites. So coords you read from view#7 are only valid until the next screenshot rebinds the view to view#8 — apply them afterward and they map through the wrong origin/scale and land somewhere else (a region-zoom vs full-overview mismatch can be off by a thousand pixels, even the wrong monitor). This, not "the app rejected it," is what's behind nearly every click that doesn't land.
view_id=N. When you take more than one screenshot before acting (locate-full then zoom-region, or a tour), pass view_id=<the N from the shot you read> on the click/move/scroll/drag. If a later screenshot has superseded it, the action is rejected with STALE VIEW: … instead of landing wrong — re-screenshot and use fresh coords. This makes "where I clicked" unambiguous.space='desktop' with absolute global px — those are transform-independent and never go stale (the desktop(x,y) coords from annotate=true, or element=<id>, already are absolute).ON but STATIC hint: do ONE interaction ON that monitor — screen_scroll a notch (guaranteed damage, content moves) or screen_click something that visibly repaints (a message, a button — NOT dead whitespace, which doesn't repaint), then screenshot. It's now live. If a region grab still 404s a frame (cold pipeline after a reload/long idle), screen_screenshot(regeo=true) re-probes + rewarms the pipelines and the full composite comes back. live:false in screen_diag = "no frame seen yet," not "off." Only call it genuinely ASLEEP/DPMS (ask the user to wake it) if interaction + regeo BOTH fail — don't jump to that.force=true to take control back — but only when the user has handed control back.view#N (pass view_id and watch for STALE VIEW). A fresh full or region screenshot followed immediately by a click on what you see there works in essentially every app. Only after that's ruled out should you consider an app quirk (e.g. a list row that genuinely needs a double-click, or a digest pane with no row handler) — and then switch to the app's primary navigation (sidebar entry, or its search/quick-switcher like Ctrl+K) rather than re-clicking the same spot.screen_type/screen_key land in whatever window holds keyboard focus, NOT the one you're looking at — a background/static-monitor app gets nothing, your keys go elsewhere. The reliable, works-everywhere fix is click-to-focus: screen_click into the target window (its message area / an empty content spot) right before typing — a real click focuses the window on any Wayland compositor, and also wakes a static monitor (the click is damage). This is the fix for "I typed but nothing happened" — it's a focus problem, not a frame-throttle (idle outputs still receive input). After clicking, screenshot to confirm before the burst.screen_focus(app=..) / focus="app" are conveniences — not required. They activate a window by name (handy when it isn't visible to click). They use an optional GNOME-shell helper if present, else the overview (which RAISES the window but may not reliably keyboard-focus it on a multi-monitor static setup). So if keys still don't land after screen_focus, fall back to the universal path: surface the window (it's now raised), then screen_click into it to take keyboard focus, then type. Never tell the user to install anything or re-login — the tool must work with their session as-is.region shot only shows the slice you asked for. If you read the lower half of a chat/list and scrolling "does nothing," the rest is likely in the half you didn't capture — widen the region to the full window (or capture top and bottom halves) before deciding the content is short or the scroll failed. Pair this with screen_diag geo to know the window/monitor bounds.Super overview (its hover tooltip shows the window title) and activate that. When focusing via search, search the application (e.g. the browser) name, not the page title.x,y (or rely on the last view's center) so it scrolls the pane you mean. After scrolling, the screenshot waits for the frame to actually CHANGE before returning, so you see the post-scroll content (not a stale frame). If a scroll truly doesn't move (rare), re-ground and confirm you're over the scrollable pane, not a fixed header.screen_diag shows uinput.available:true, clicks/keys/scroll go through a kernel-level unified pointer device (exact landing, monitor-state-independent, scroll lands in Electron). Otherwise the portal path is used (less reliable for motion/scroll on static monitors). Needs /dev/uinput writable + python-evdev.screen_reload re-execs the server in place (no /mcp reconnect needed) to pick up changes.Report what's actually on screen. If the target app isn't visible, the monitor's asleep, or a view won't navigate, say so and ask the user for the one physical action only they can do (wake a monitor, foreground an app) — don't guess or invent content.
Requests code review by dispatching a subagent with git diff context. Use after completing tasks, major features, or before merging to catch issues early.
npx claudepluginhub 88plug/claude-code-plugins --plugin screen-mcp