From content-vault
Edits video by conversation: transcribe, cut, color grade, overlay animations, burn subtitles. For talking heads, montages, tutorials, travel, interviews.
How this skill is triggered — by the user, by Claude, or both
Slash command
/content-vault:video-useThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
1. **LLM reasons from raw transcript + on-demand visuals.** The only derived artifact that earns its keep is a packed phrase-level transcript (`takes_packed.md`). Everything else — filler tagging, retake detection, shot classification, emphasis scoring — you derive at decision time.
README.mdhelpers/grade.pyhelpers/pack_transcripts.pyhelpers/render.pyhelpers/timeline_view.pyhelpers/transcribe.pyhelpers/transcribe_batch.pyinstall.mdposter.htmlpyproject.tomlskills/manim-video/README.mdskills/manim-video/references/animation-design-thinking.mdskills/manim-video/references/animations.mdskills/manim-video/references/camera-and-3d.mdskills/manim-video/references/decorations.mdskills/manim-video/references/equations.mdskills/manim-video/references/graphs-and-data.mdskills/manim-video/references/mobjects.mdskills/manim-video/references/paper-explainer.mdskills/manim-video/references/production-quality.mdtakes_packed.md). Everything else — filler tagging, retake detection, shot classification, emphasis scoring — you derive at decision time.These are the things where deviation produces silent failures or broken output. They are not taste, they are correctness. Memorize them.
-c copy concat, not single-pass filtergraph. Otherwise you double-encode every segment when overlays are added.afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03). Otherwise audible pops at every cut.setpts=PTS-STARTPTS+T/TB to shift the overlay's frame 0 to its window start. Otherwise you see the middle of the animation during the overlay window.output_time = word.start - segment_start + segment_offset. Otherwise captions misalign after segment concat.Agent tool; total wall time ≈ slowest one.<videos_dir>/edit/. Never write inside the video-use/ project directory.Everything else in this document is a worked example. Deviate whenever the material calls for it.
The skill lives in video-use/. User footage lives wherever they put it. All session outputs go into <videos_dir>/edit/.
<videos_dir>/
├── <source files, untouched>
└── edit/
├── project.md ← memory; appended every session
├── takes_packed.md ← phrase-level transcripts, the LLM's primary reading view
├── edl.json ← cut decisions
├── transcripts/<name>.json ← cached raw Scribe JSON
├── animations/slot_<id>/ ← per-animation source + render + reasoning
├── clips_graded/ ← per-segment extracts with grade + fades
├── master.srt ← output-timeline subtitles
├── downloads/ ← yt-dlp outputs
├── verify/ ← debug frames / timeline PNGs
├── preview.mp4
└── final.mp4
First-time install lives in install.md (clone, deps, ffmpeg, skill registration, API key). Don't re-run it every session; on cold start just verify:
ELEVENLABS_API_KEY resolves — either in the environment or in .env at the video-use repo root. If missing, ask the user to paste one and write it to .env (never to the user's <videos_dir>).ffmpeg + ffprobe on PATH.uv sync or pip install -e . inside the repo).yt-dlp, manim, Remotion installed only on first use.skills/manim-video/. Read its SKILL.md when building a Manim slot.Helpers (helpers/transcribe.py, helpers/render.py, etc.) live alongside this SKILL.md. Resolve their paths relative to the directory containing this file — the skill is typically symlinked at ~/.claude/skills/video-use/ or ~/.codex/skills/video-use/.
transcribe.py <video> — single-file Scribe call. --num-speakers N optional. Cached.transcribe_batch.py <videos_dir> — 4-worker parallel transcription. Use for multi-take.pack_transcripts.py --edit-dir <dir> — transcripts/*.json → takes_packed.md (phrase-level, break on silence ≥ 0.5s).timeline_view.py <video> <start> <end> — filmstrip + waveform PNG. On-demand visual drill-down. Not a scan tool — use it at decision points, not constantly.render.py <edl.json> -o <out> — per-segment extract → concat → overlays (PTS-shifted) → subtitles LAST. --preview for 720p fast. --build-subtitles to generate master.srt inline.grade.py <in> -o <out> — ffmpeg filter chain grade. Presets + --filter '<raw>' for custom.For animations, create <edit>/animations/slot_<id>/ with Bash and spawn a sub-agent via the Agent tool.
Inventory. ffprobe every source. transcribe_batch.py on the directory. pack_transcripts.py to produce takes_packed.md. Sample one or two timeline_views for a visual first impression.
Pre-scan for problems. One pass over takes_packed.md to note verbal slips, obvious mis-speaks, or phrasings to avoid. Plain list, feed into the editor brief.
Converse. Describe what you see in plain English. Ask questions shaped by the material. Collect: content type, target length/aspect, aesthetic/brand direction, pacing feel, must-preserve moments, must-cut moments, animation and grade preferences, subtitle needs. Do not use a fixed checklist — the right questions are different every time.
Propose strategy. 4–8 sentences: shape, take choices, cut direction, animation plan, grade direction, subtitle style, length estimate. Wait for confirmation.
Execute. Produce edl.json via the editor sub-agent brief. Drill into timeline_view at ambiguous moments. Build animations in parallel sub-agents. Apply grade per-segment. Compose via render.py.
Preview. render.py --preview.
Self-eval (before showing the user). Run timeline_view on the rendered output (not the sources) at every cut boundary (±1.5s window). Check each image for:
Also sample: first 2s, last 2s, and 2–3 mid-points — check grade consistency, subtitle readability, overall coherence. Run ffprobe on the output to verify duration matches the EDL expectation.
If anything fails: fix → re-render → re-eval. Cap at 3 self-eval passes — if issues remain after 3, flag them to the user rather than looping forever. Only present the preview once the self-eval passes.
Iterate + persist. Natural-language feedback, re-plan, re-render. Never re-transcribe. Final render on confirmation. Append to project.md.
(laughs), (sighs), (applause) mark beats. Extend past them.pack_transcripts.py reads all transcripts/*.json and produces one markdown file where each take is a list of phrase-level lines, each prefixed with its [start-end] time range. Phrases break on any silence ≥ 0.5s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON.
Example line:
## C0103 (duration: 43.0s, 8 phrases)
[002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
[006.08-006.74] S0 We fixed this.
When the task is "pick the best take of each beat across many clips," spawn a dedicated sub-agent with a brief shaped like this. The structure is load-bearing; the pitch-shape example is not.
You are editing a <type> video. Pick the best take of each beat and
assemble them chronologically by beat, not by source clip order.
INPUTS:
- takes_packed.md (time-annotated phrase-level transcripts of all takes)
- Product/narrative context: <2 sentences from the user>
- Speaker(s): <name, role, delivery style note>
- Expected structure: <pick an archetype or invent one>
- Verbal slips to avoid: <list from the pre-scan pass>
- Target runtime: <seconds>
Common structural archetypes (pick, adapt, or invent):
- Tech launch / demo: HOOK → PROBLEM → SOLUTION → BENEFIT → EXAMPLE → CTA
- Tutorial: INTRO → SETUP → STEPS → GOTCHAS → RECAP
- Interview: (QUESTION → ANSWER → FOLLOWUP) repeat
- Travel / event: ARRIVAL → HIGHLIGHTS → QUIET MOMENTS → DEPARTURE
- Documentary: THESIS → EVIDENCE → COUNTERPOINT → CONCLUSION
- Music / performance: INTRO → VERSE → CHORUS → BRIDGE → OUTRO
- Or invent your own.
RULES:
- Start/end times must fall on word boundaries from the transcript.
- Pad cut boundaries (working window 30–200ms).
- Prefer silences ≥ 400ms as cut targets.
- Unavoidable slips are kept if no better take exists. Note them in "reason".
- If over budget, revise: drop a beat or trim tails. Report total and self-correct.
OUTPUT (JSON array, no prose):
[{"source": "C0103", "start": 2.42, "end": 6.85, "beat": "HOOK",
"quote": "...", "reason": "..."}, ...]
Return the final EDL and a one-line total runtime check.
Your job is to reason about the image, not apply a preset. Look at a frame (via timeline_view), decide what's wrong, adjust one thing, look again.
Mental model is ASC CDL. Per channel: out = (in * slope + offset) ** power, then global saturation. slope → highlights, offset → shadows, power → midtones.
Example filter chains (grade.py has --list-presets; use them as starting points or mix your own):
warm_cinematic — retro/technical, subtle teal/orange split, desaturated. Shipped in a real launch video. Safe for talking heads.neutral_punch — minimal corrective: contrast bump + gentle S-curve. No hue shifts.none — straight copy. Default when the user hasn't asked.For anything else — portraiture, nature, product, music video, documentary — invent your own chain. grade.py --filter '<raw ffmpeg>' accepts any filter string.
Hard rules: apply per-segment during extraction (not post-concat, which re-encodes twice). Never go aggressive without testing skin tones.
Subtitles have three dimensions worth reasoning about: chunking (1/2/3/sentence per line), case (UPPER/Title/Natural), and placement (margin from bottom). The right combo depends on content.
Worked styles — pick, adapt, or invent:
bold-overlay — short-form tech launch, fast-paced social. 2-word chunks, UPPERCASE, break on punctuation, Helvetica 18 Bold, white-on-outline, MarginV=35. render.py ships with this as SUB_FORCE_STYLE.
FontName=Helvetica,FontSize=18,Bold=1,
PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BackColour=&H00000000,
BorderStyle=1,Outline=2,Shadow=0,
Alignment=2,MarginV=35
natural-sentence (if you invent this mode) — narrative, documentary, education. 4–7 word chunks, sentence case, break on natural pauses, MarginV=60–80, larger font for readability, slightly wider max-width. No shipped force_style — design one if you need it.
Invent a third style if neither fits. Hard rules: subtitles LAST (Rule 1), output-timeline offsets (Rule 5).
Animations match the content and the brand. Get the palette, font, and visual language from the conversation — never assume a default. If the user hasn't told you, propose a palette in the strategy phase and wait for confirmation before building anything.
Tool options:
skills/manim-video/SKILL.md and its references for depth.None is mandatory. Invent hybrids if useful (e.g., PIL background with a Remotion layer on top).
Duration rules of thumb, context-dependent:
narration_length + 1s (universal).Animation payoff timing (rule for sync-to-narration): get the payoff word's timestamp. Start the overlay reveal_duration seconds earlier so the landing frame coincides with the spoken payoff word. Without this sync the animation feels disconnected.
Easing (universal — never linear, it looks robotic):
def ease_out_cubic(t): return 1 - (1 - t) ** 3
def ease_in_out_cubic(t):
if t < 0.5: return 4 * t ** 3
return 1 - (-2 * t + 2) ** 3 / 2
ease_out_cubic for single reveals (slow landing). ease_in_out_cubic for continuous draws.
Typing text anchor trick: center on the FULL string's width, not the partial-string width — otherwise text slides left during reveal.
Example palette (the launch video — one aesthetic among infinite):
(10, 10, 10) near-black#FF5A00 / (255, 90, 0) orange(110, 110, 110) dim gray/System/Library/Fonts/Menlo.ttc (index 1)This is one style. If the brand is warm and serif, use that. If it's colorful and playful, use that. If the user handed you a style guide, follow it. If they didn't, propose one and confirm.
Parallel sub-agent brief — each animation is one sub-agent spawned via the Agent tool. Each prompt is self-contained (sub-agents have no parent context). Include:
<edit>/animations/slot_<id>/render.mp4)One sub-agent = one file (unique filenames, parallel agents don't overwrite each other).
Match the source unless the user asked for something specific. Common targets: 1920×1080@24 cinematic, 1920×1080@30 screen content, 1080×1920@30 vertical social, 3840×2160@24 4K cinema, 1080×1080@30 square. render.py defaults the scale to 1080p from any source; pass --filter or edit the extract command for other targets. Worth asking the user which delivery format matters.
{
"version": 1,
"sources": {"C0103": "/abs/path/C0103.MP4", "C0108": "/abs/path/C0108.MP4"},
"ranges": [
{"source": "C0103", "start": 2.42, "end": 6.85,
"beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip at 38.46."},
{"source": "C0108", "start": 14.30, "end": 28.90,
"beat": "SOLUTION", "quote": "...", "reason": "Only take without the false start."}
],
"grade": "warm_cinematic",
"overlays": [
{"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}
],
"subtitles": "edit/master.srt",
"total_duration_s": 87.4
}
grade is a preset name or raw ffmpeg filter. overlays are rendered animation clips. subtitles is optional and applied LAST.
project.mdAppend one section per session at <edit>/project.md:
## Session N — YYYY-MM-DD
**Strategy:** one paragraph describing the approach
**Decisions:** take choices, cuts, grades, animations + why
**Reasoning log:** one-line rationale for non-obvious decisions
**Outstanding:** deferred items
On startup, read project.md if it exists and summarize the last session in one sentence before asking whether to continue.
Things that consistently fail regardless of style:
npx claudepluginhub timscheuerai/content-vaultAI-assisted video editing workflows that cut, structure, and augment real footage. Covers the full pipeline from raw capture through FFmpeg, Remotion, ElevenLabs, fal.ai, and final polish in Descript or CapCut.
AI-assisted video editing workflow for cutting, structuring, and enhancing real footage using FFmpeg, Remotion, ElevenLabs, and fal.ai. Activates when users want to edit video, trim clips, make vlogs, or build video content.
Turns raw talking-head/screen-share recordings into finished edited videos with transcript-synced overlays, then generates blog/social/YouTube content.