Skill

video-understanding

Analyzes a video into a structured understanding index: scene detection, ASR transcript, per-scene visual analysis, silence windows, fused timeline, and narration brief. Use to index, summarize, or prepare video content for downstream narration.

Python

FFmpeg

ai-ml

Popularity

Stars

287

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/video-recap-skills:video-understanding

Not user invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Turns a source video into an **understanding index** an agent (or a downstream stage) can read:

Supporting Files

references/data-schema.mdreferences/prompt-templates.mdreferences/research-guide.mdscripts/asr.pyscripts/brief.pyscripts/consolidate.pyscripts/detect.pyscripts/extract.pyscripts/lib.pyscripts/storyboard.pyscripts/understand.pyscripts/vlm.py

SKILL.md

71 lines · ~886 tokens

Stats

LanguagePython

Stars287

Forks49

MaintenanceExcellent

Last CommitJun 19, 2026

Actions

View Source View Plugin View on GitHub View README

What this does

Turns a source video into an understanding index an agent (or a downstream stage) can read:

Scene detection — scenes.json (cut points, durations) + junk-scene filtering.
Frame extraction — sampled frames for the visual analysis.
ASR — asr_result.json (timestamped dialogue) via MiMo mimo-v2.5-asr.
Silence detection — silence_periods.json (quiet windows, has_speech flag).
VLM analysis — vlm_analysis.json (per-scene description, depth analysis, frame_facts).
Timeline fusion + brief — timeline_fusion.json, asr_writing_chunks.json, agent_narration_brief.md.

Stateless: reusable stages are skipped only when their output and provenance sidecar match the current source video plus output-affecting settings. --force recomputes.

Requirements

# ffmpeg: brew install ffmpeg | apt install ffmpeg | choco install ffmpeg
export MIMO_API_KEY=***          # one key drives ASR (mimo-v2.5-asr) + VLM (mimo-v2.5)

ASR uses MiMo mimo-v2.5-asr; pass --skip-asr to skip dialogue transcription. The full understanding run still requires MIMO_API_KEY for VLM scene analysis. Optional MiMo scene-chunk video understanding: --mimo-video-overview.

If work_dir/background_research.json exists (story research the agent did first, see references/research-guide.md), its synopsis and named characters are folded into the VLM context, so scene descriptions can name people and read scenes with plot knowledge. Combine with --context for a quick inline hint.

Run

python3 scripts/understand.py <video> --work-dir <work_dir> \
  [--context "节目名/角色名"] [--scene-threshold 0.1] [--skip-asr] [--mimo-video-overview] [--force]

Output contract

File	Content
`scenes.json`	scene cut list (start/end/duration)
`asr_result.json`	`[{start, end, text}]` timestamped transcript
`vlm_analysis.json`	per-scene description / depth / `frame_facts`
`silence_periods.json`	`[{start, end, duration, has_speech}]` quiet windows
`timeline_fusion.json`	VLM + ASR + silence overlap, unified timeline
`asr_writing_chunks.json`	ASR split at sentence boundaries, scene-aligned
`agent_narration_brief.md`	the human/agent-facing writing brief (read this first)

Downstream, video-script reads the brief + index to write narration.json.

References

Background research before writing: references/research-guide.md (writes background_research.json).
Output JSON shapes: references/data-schema.md.

What this skill does NOT do

Does NOT write narration / 解说词 or score it — that is video-script.
Does NOT cut, edit, voice, or render video.
Does NOT invent plot the signal doesn't support — it emits a substrate warning when ASR/VLM are thin, rather than fabricating.
Does NOT publish or schedule anything; it writes artifacts to work_dir and stops.

video-understanding

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

video-understanding

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

What this does

Requirements

Run

Output contract

References

What this skill does NOT do

Similar Skills

What this does

Requirements

Run

Output contract

References

What this skill does NOT do

Similar Skills