Skill

video-perception

Analyzes video files (.mp4, .mov, .avi, .mkv, .webm) using ffmpeg for scene changes, silence, motion, transcription; extracts targeted frames and audio segments.

developer-tools

npx claudepluginhub jordanrendric/claude-video-vision --plugin claude-video-vision

Tool Access

This skill uses the workspace's default tool permissions.

Preview

You have access to video understanding tools via the claude-video-vision MCP server.

SKILL.md

Similar Skills

video-decompose

Decomposes videos into meaningful keyframes using ffmpeg scene detection filter. Extracts images on scene changes (threshold 0.01), timestamps from logs, supports MP4/MOV/WEBM/AVI/MKV. Adjusts sensitivity; warns on low frame counts indicating static videos.

1 file

reviw-plugin

ffmpeg-video-analysis

Provides FFmpeg filters and commands for video QC: detects black/blurry/frozen frames, crop/scene/interlace; computes PSNR/SSIM/VMAF. For automation workflows.

ffmpeg-master

cache-components

139.3k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

Stats

Stars453

Forks58

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Video Perception

You have access to video understanding tools via the claude-video-vision MCP server.

Available Tools

video_analyze — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.
video_watch — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.
video_detail — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.
video_info — Get video metadata without processing.
video_configure — Change settings (backend, resolution, enable_index, etc.).
video_setup — Check/install dependencies.

Workflow

IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.

Always start with video_info to get duration, resolution, and audio presence.

REQUIRED for videos > 30s: Call video_analyze BEFORE extracting any frames. This is NOT optional — it gives you structural data to make smart extraction decisions. Select filters relevant to the user's question:

User intent	Filters to select
"What happens in this video?"	scene_changes, silence, transcription
"Find the scene transitions"	scene_changes, black_intervals
"Are there frozen/stuck parts?"	freeze, blur
"Is this a talking head or action?"	motion
"When does the music start?"	silence, loudness
"Analyze the lighting"	exposure
"Summarize this lecture"	transcription, scene_changes, silence
General / unclear intent	scene_changes, silence, transcription

Always include transcription: true when the video has audio — the transcription tells you WHERE to look visually.

Use the analysis results and transcription to plan your frame extraction strategy:
- Low FPS (0.1-0.5) for static or predictable segments
- Higher FPS (1-3) only around scene changes, motion peaks, or moments referenced in speech ("look at this", "as you can see", "let me show you")
- Never exceed the minimum FPS needed for the task
- Prefer fewer segments at lower FPS — you can always drill deeper
Call video_watch to extract frames:
- For short videos (< 2 minutes): Use fps: "auto" without view_sample — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.
- For long videos (> 2 minutes): Use segments based on analysis data with variable FPS, and view_sample to limit initial frame count. You can always drill deeper with video_detail.
Use video_detail to drill into specific moments:
- Start with 3-5 second windows around points of interest
- Use view_sample: 3 to preview (first, middle, last frame)
- Then request specific timestamps with view if you need more detail
- Expand the window only if the initial view is insufficient
- Treat frame viewing like a binary search — narrow down to what matters
- Never view all extracted frames at once
When the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.

Parameter Guide

fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.

resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.

segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.

view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.

skip_audio: Set to true when you only need visual analysis.

Working with Results

You receive:

Manifest (when enable_index is on) — index of all cached frames by resolution and timestamp. Use this to avoid redundant requests.
Frames as images — look at them to understand what's happening visually
Audio transcription with timestamps — read the speech content
Audio tags — non-speech events (music, sounds, etc.)
Analysis data — scene changes, silence intervals, motion levels, etc.

Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.