VLM Run Skills are definitions for visual AI tasks like image understanding, video processing, and document extraction. They are interoperable with Anthropic's Claude Code.
The Skills in this repository follow the standardized Agent Skill format.
How do Skills work?
In practice, skills are self-contained folders that package instructions, scripts, and resources together for an AI agent to use on a specific use case. Each folder includes a SKILL.md file with YAML frontmatter (name and description) followed by the guidance your coding agent follows while the skill is active.
Features
Image Intelligence
- Understanding & Captioning: Describe, analyze, and interpret images with state-of-the-art visual intelligence
- Detection & Localization: Detect and locate objects, people, faces, and custom entities with bounding boxes
- Segmentation: Segment objects, scenes, and regions with pixel-level precision
- Generation & Editing: Generate images from text, edit existing images, apply super-resolution, colorize B&W photos
- Tools: Crop, rotate, enhance resolution (4x-8x upscaling), de-oldify (colorization)
- Visual Grounding: Point to and extract specific elements using natural language queries
- UI Parsing: Extract UI elements, layouts, and hierarchies from screenshots
Video Intelligence
- Understanding & Captioning: Describe video content, generate summaries and detailed scene analysis
- Transcription: Extract audio transcripts with timestamps
- Tools: Trim videos, extract keyframes, sample frames at intervals, detect highlights
- Segmentation: Identify and segment objects across video frames
- Generation & Editing: Generate videos from text prompts, edit existing videos
Document Intelligence
- Layout Understanding: Detect headers, paragraphs, tables, figures, lists, and structural elements
- Multi-Page Analysis: Process and analyze PDFs with intelligent page-aware extraction
- Markdown Extraction: Convert documents to clean, structured markdown with preserved formatting
- Visual Grounding: Locate and extract specific fields, sections, or data points
- Data Extraction: Extract key information from invoices, receipts, contracts, forms into structured JSON
Multi-modal Agents
- Multi-Modal Reasoning: Execute complex multi-step workflows across images, documents, and videos
- Structured Outputs: Get results in validated JSON schemas with automatic retry logic
See docs and technical whitepaper for more information.
Installation
Prerequisites
- Get your VLM Run API key from app.vlm.run
- Have uv installed for Python environment management
Claude Code
- Register the repository as a plugin marketplace:
/plugin marketplace add vlm-run/skills
- To install a skill, run:
/plugin install <skill-name>@vlm-run/skills
For example:
/plugin install vlmrun-cli-skill@vlm-run/skills
Configure your API key
Once the skill is installed, configure your API key using the CLI (get your key from app.vlm.run):
vlmrun config init
vlmrun config set --api-key <your-api-key>
vlmrun config show
Verify Installation
Once installed, verify the skill is loaded by asking Claude Code (requires restart):
What skills are available in the /vlmrun-cli-skill?
Installing in Claude for Desktop