From video-recap-skills
Analyzes a video into a structured understanding index: scene detection, ASR transcript, per-scene visual analysis, silence windows, fused timeline, and narration brief. Use to index, summarize, or prepare video content for downstream narration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/video-recap-skills:video-understandingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Turns a source video into an **understanding index** an agent (or a downstream stage) can read:
Turns a source video into an understanding index an agent (or a downstream stage) can read:
scenes.json (cut points, durations) + junk-scene filtering.asr_result.json (timestamped dialogue) via MiMo mimo-v2.5-asr.silence_periods.json (quiet windows, has_speech flag).vlm_analysis.json (per-scene description, depth analysis, frame_facts).timeline_fusion.json, asr_writing_chunks.json, agent_narration_brief.md.Stateless: reusable stages are skipped only when their output and provenance sidecar match
the current source video plus output-affecting settings. --force recomputes.
# ffmpeg: brew install ffmpeg | apt install ffmpeg | choco install ffmpeg
export MIMO_API_KEY=*** # one key drives ASR (mimo-v2.5-asr) + VLM (mimo-v2.5)
ASR uses MiMo mimo-v2.5-asr; pass --skip-asr to skip dialogue transcription. The full understanding run still requires MIMO_API_KEY for VLM scene analysis.
Optional MiMo scene-chunk video understanding: --mimo-video-overview.
If work_dir/background_research.json exists (story research the agent did first, see
references/research-guide.md), its synopsis and named characters are folded into the VLM
context, so scene descriptions can name people and read scenes with plot knowledge. Combine with
--context for a quick inline hint.
python3 scripts/understand.py <video> --work-dir <work_dir> \
[--context "节目名/角色名"] [--scene-threshold 0.1] [--skip-asr] [--mimo-video-overview] [--force]
| File | Content |
|---|---|
scenes.json | scene cut list (start/end/duration) |
asr_result.json | [{start, end, text}] timestamped transcript |
vlm_analysis.json | per-scene description / depth / frame_facts |
silence_periods.json | [{start, end, duration, has_speech}] quiet windows |
timeline_fusion.json | VLM + ASR + silence overlap, unified timeline |
asr_writing_chunks.json | ASR split at sentence boundaries, scene-aligned |
agent_narration_brief.md | the human/agent-facing writing brief (read this first) |
Downstream, video-script reads the brief + index to write narration.json.
references/research-guide.md (writes background_research.json).references/data-schema.md.npx claudepluginhub worldwonderer/video-recap-skillsAnalyzes video files or YouTube URLs: extracts frames/audio, detects scenes/motion/silence/transitions via ffmpeg tools with structured workflow.
Generates Chinese-narration recap videos from source files. Orchestrates video understanding, narration writing, scene cutting, voiceover synthesis, and final assembly using a single MiMo API key and ffmpeg.
Analyzes a video synchronously using TwelveLabs AI to return a summary or answer questions about its content. Accepts video URLs, file paths, asset IDs, or indexed video IDs.