Plugin

web-infra-dev-midscene-skills-1

Name: web-infra-dev-midscene-skills-1
Author: web-infra-dev

Automate UI interactions across Android, iOS, HarmonyOS, desktop (macOS/Windows/Linux), and browsers using AI vision from screenshots. Execute taps, swipes, typing, app launches, and E2E tests via natural language commands without DOM or accessibility labels.

npx claudepluginhub web-infra-dev/midscene-skills

Component Overview

Skills

Component Details

Skills (6)

android-device-automation

/android-automation

Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen, test android app, verify mobile app, QA on phone, check the app on android, test on device, see if the app works on phone, end-to-end test on android, visual verification on mobile Powered by Midscene.js (https://midscenejs.com)

browser-automation

/browser

Vision-driven browser automation using Midscene. Operates from screenshots — no DOM or accessibility labels needed. Runs in headless Puppeteer — does NOT take over the user's mouse or keyboard. Also supports CDP mode and Bridge mode to connect to an existing Chrome. Use this skill when the user wants to: - Browse, navigate, or open web pages - Scrape, extract, or collect data from websites - Fill out forms, click buttons, or interact with web elements - Verify, validate, test, or QA frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Test what was just built, see if it works in browser - Connect to Chrome via CDP, DevTools Protocol, or remote debugging - Connect to user's Chrome browser, control my browser, operate my Chrome Powered by Midscene.js (https://midscenejs.com)

desktop-computer-automation

/computer-automation

Vision-driven desktop automation using Midscene. Control your desktop (macOS, Windows, Linux) with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. ⚠️ Takes over the user's real mouse and keyboard. For web apps, prefer "Browser Automation" instead. Only use this for desktop-native apps (Electron, Qt, native macOS/Windows/Linux) that cannot run in a browser. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, test Electron app Powered by Midscene.js (https://midscenejs.com)

harmonyos-device-automation

/harmony-automation

Vision-driven HarmonyOS NEXT device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control HarmonyOS devices with natural language commands via HDC. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: harmony, harmonyos, 鸿蒙, hdc, huawei device, harmony app, harmony automation, harmony phone, harmony tablet, test harmony app, verify on harmonyos, QA on 鸿蒙, check the app on harmony, test on huawei device, see if the app works on harmony, end-to-end test on harmonyos, visual verification on 鸿蒙 Powered by Midscene.js (https://midscenejs.com)

ios-device-automation

/ios-automation

Vision-driven iOS device automation using Midscene CLI. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control iOS devices with natural language commands via WebDriverAgent. Triggers: ios, iphone, ipad, ios app, tap on iphone, swipe, mobile app ios, ios device, ios testing, iphone automation, ipad automation, ios screen, ios navigate, test ios app, verify on iphone, QA on ipad, check the app on ios, test on ios device, see if the app works on iphone, end-to-end test on ios, visual verification on ios Powered by Midscene.js (https://midscenejs.com)

vitest-midscene-e2e

/vitest-midscene-e2e

Enhances Vitest with Midscene for AI-powered UI testing across Web (Playwright), Android (ADB), and iOS (WDA). Scaffolds new projects, converts existing projects, and creates/updates/debugs/runs E2E tests using natural-language UI interactions. Triggers: write test, add test, create test, update test, fix test, debug test, run test, e2e test, midscene test, new project, convert project, init project, 写测试, 加测试, 创建测试, 更新测试, 修复测试, 调试测试, 运行测试, 新建工程, 转化工程.

README

Midscene.js

Midscene Skills

Vision-driven cross-platform automation

Natural-language driven UI control
Built on Midscene.js's vision-based automation capabilities — Operates entirely from screenshots, making cross-platform support reliable
This repository contains Skills for the following platforms:
- Browser (Puppeteer, headless Chrome; also supports CDP connection): skills/browser
- Chrome Bridge (user's own Chrome browser): skills/chrome-bridge
- Desktop (macOS, Windows, Linux): skills/computer-automation
- Android (controlled via ADB): skills/android-automation
- iOS (controlled via WebDriverAgent): skills/ios-automation
- HarmonyOS (controlled via HDC): skills/harmony-automation
- Vitest + Midscene E2E (scaffold, convert, and manage AI-powered E2E tests for Web, Android, iOS): skills/vitest-midscene-e2e

Safety Warning

⚠️ AI-driven UI automation may produce unpredictable results since it can control EVERYTHING on the screen. Please evaluate the risks carefully before use.

Installation

Make sure you have Node.js installed.

Then install the skills:

# General installation
npx skills add web-infra-dev/midscene-skills

# Claude Code 
npx skills add web-infra-dev/midscene-skills -a claude-code

# OpenClaw
npx skills add web-infra-dev/midscene-skills -a openclaw

Model Setup

Midscene requires models with strong visual grounding capabilities (accurate UI element localization from screenshots).
Because of this, you need to prepare model access and configuration separately from skill installation.

Make sure these environment variables are available in your system. You can also define them in a .env file in the current directory, and Midscene will load them automatically:

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

Example: Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

Example: Qwen3-VL

MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"

Example: Doubao Seed 1.6

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"

Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

Model setup docs:

Midscene model strategy: https://midscenejs.com/model-strategy
Qwen3-VL: https://midscenejs.com/model-common-config#qwen3-vl
Gemini-3-Flash: https://midscenejs.com/model-common-config#gemini-3-pro-and-gemini-3-flash
Doubao Seed 1.6: https://midscenejs.com/model-common-config

Use

In your chatbot or coding agent, you can say:

Use Midscene computer skill to open the Keynote app and create a new presentation.

Use Midscene browser skill to open the Google search page and search for "Midscene".

Issues

For bug reports, feature requests, and discussions, please visit the main Midscene repository: https://github.com/web-infra-dev/midscene/issues

License

MIT

Similar Plugins

phantom

242

Computer use toolkit for driving desktop environments through Claude's vision and action API with screenshot capture, mouse/keyboard control, and an autonomous agent loop

Stats

Version1.0.0

Stars196

Forks13

MaintenanceExcellent

AddedApr 26, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Help us improve

Share bugs, ideas, or general feedback.

Back to Plugins

Midscene.js

Midscene Skills

Vision-driven cross-platform automation

Natural-language driven UI control
Built on Midscene.js's vision-based automation capabilities — Operates entirely from screenshots, making cross-platform support reliable
This repository contains Skills for the following platforms:
- Browser (Puppeteer, headless Chrome; also supports CDP connection): skills/browser
- Chrome Bridge (user's own Chrome browser): skills/chrome-bridge
- Desktop (macOS, Windows, Linux): skills/computer-automation
- Android (controlled via ADB): skills/android-automation
- iOS (controlled via WebDriverAgent): skills/ios-automation
- HarmonyOS (controlled via HDC): skills/harmony-automation
- Vitest + Midscene E2E (scaffold, convert, and manage AI-powered E2E tests for Web, Android, iOS): skills/vitest-midscene-e2e

Safety Warning

⚠️ AI-driven UI automation may produce unpredictable results since it can control EVERYTHING on the screen. Please evaluate the risks carefully before use.

Installation

Make sure you have Node.js installed.

Then install the skills:

# General installation
npx skills add web-infra-dev/midscene-skills

# Claude Code 
npx skills add web-infra-dev/midscene-skills -a claude-code

# OpenClaw
npx skills add web-infra-dev/midscene-skills -a openclaw

Model Setup

Make sure these environment variables are available in your system. You can also define them in a .env file in the current directory, and Midscene will load them automatically:

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

Example: Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

Example: Qwen3-VL

MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"

Example: Doubao Seed 1.6

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"

Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

Model setup docs:

Midscene model strategy: https://midscenejs.com/model-strategy
Qwen3-VL: https://midscenejs.com/model-common-config#qwen3-vl
Gemini-3-Flash: https://midscenejs.com/model-common-config#gemini-3-pro-and-gemini-3-flash
Doubao Seed 1.6: https://midscenejs.com/model-common-config

Use

In your chatbot or coding agent, you can say:

Use Midscene computer skill to open the Keynote app and create a new presentation.

Use Midscene browser skill to open the Google search page and search for "Midscene".

Issues

For bug reports, feature requests, and discussions, please visit the main Midscene repository: https://github.com/web-infra-dev/midscene/issues

License

MIT

web-infra-dev-midscene-skills-1

Component Overview

Component Details

Skills (6)

README

Midscene Skills

Safety Warning

Installation

Model Setup

Use

Issues

License

Similar Plugins

phantom

Help us improve

Help us improve

web-infra-dev-midscene-skills-1

Component Overview

Component Details

Skills (6)

README

Midscene Skills

Safety Warning

Installation

Model Setup

Use

Issues

License

Similar Plugins

phantom

Help us improve

handson

mobile-app-tester

agent-browser

claude-in-mobile

dev-browser