From claude-video-vision
Analyzes video files (.mp4, .mov, .avi, .mkv, .webm) using ffmpeg for scene changes, silence, motion, transcription; extracts targeted frames and audio segments.
npx claudepluginhub jordanrendric/claude-video-vision --plugin claude-video-visionThis skill uses the workspace's default tool permissions.
You have access to video understanding tools via the claude-video-vision MCP server.
Decomposes videos into meaningful keyframes using ffmpeg scene detection filter. Extracts images on scene changes (threshold 0.01), timestamps from logs, supports MP4/MOV/WEBM/AVI/MKV. Adjusts sensitivity; warns on low frame counts indicating static videos.
Provides FFmpeg filters and commands for video QC: detects black/blurry/frozen frames, crop/scene/interlace; computes PSNR/SSIM/VMAF. For automation workflows.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Share bugs, ideas, or general feedback.
You have access to video understanding tools via the claude-video-vision MCP server.
video_analyze — Analyze video structure with ffmpeg filters (scene changes, silence, motion, etc.). Use this BEFORE extracting frames to plan your strategy.video_watch — Extract frames + process audio from a video. Supports variable FPS/resolution per segment.video_detail — Drill into specific segments. Separates extraction from viewing — extract many frames, view few at a time.video_info — Get video metadata without processing.video_configure — Change settings (backend, resolution, enable_index, etc.).video_setup — Check/install dependencies.IMPORTANT: You MUST follow these steps in order. Do NOT skip step 2.
Always start with video_info to get duration, resolution, and audio presence.
REQUIRED for videos > 30s: Call video_analyze BEFORE extracting any frames.
This is NOT optional — it gives you structural data to make smart extraction decisions.
Select filters relevant to the user's question:
| User intent | Filters to select |
|---|---|
| "What happens in this video?" | scene_changes, silence, transcription |
| "Find the scene transitions" | scene_changes, black_intervals |
| "Are there frozen/stuck parts?" | freeze, blur |
| "Is this a talking head or action?" | motion |
| "When does the music start?" | silence, loudness |
| "Analyze the lighting" | exposure |
| "Summarize this lecture" | transcription, scene_changes, silence |
| General / unclear intent | scene_changes, silence, transcription |
Always include transcription: true when the video has audio — the transcription
tells you WHERE to look visually.
Use the analysis results and transcription to plan your frame extraction strategy:
Call video_watch to extract frames:
fps: "auto" without view_sample — short videos need full coverage to avoid missing brief moments. The auto FPS already adapts to duration.segments based on analysis data with variable FPS, and view_sample to limit initial frame count. You can always drill deeper with video_detail.Use video_detail to drill into specific moments:
view_sample: 3 to preview (first, middle, last frame)view if you need more detailWhen the user asks follow-up questions about the same video, consult the manifest already in your context. Do not re-extract frames you already have at the same resolution. Do not re-request frames you already have in context.
fps: "auto" for general overview. Use the video's original fps (from video_info) for frame-by-frame detail. Use 5-10 for analyzing specific short moments. Use 0.1-0.5 for long videos.
resolution: 256-512 for quick scans. 512-768 for normal analysis. 1024+ when reading on-screen text or fine details.
segments: Use when you have analysis data. Each segment can have its own fps and resolution. Overrides global fps/start_time/end_time.
view_sample: Returns N evenly spaced frames from the extracted set. Use this to avoid flooding context with too many images.
skip_audio: Set to true when you only need visual analysis.
You receive:
Combine all sources to form a complete understanding. Use analysis + transcription to guide where you look visually. The analysis tells you WHEN things happen; the frames tell you WHAT happens.