Convert written documents to narrated video scripts with TTS audio and word-level timing. Use when preparing essays, blog posts, or articles for video narration. Outputs scene files, audio, and VTT with precise word timestamps. Keywords: narration, voiceover, TTS, scenes, audio, timing, video script, spoken.
npx claudepluginhub joshuarweaver/cascade-content-creation-misc-1 --plugin jwynia-agent-skills-1This skill uses the workspace's default tool permissions.
Convert written documents into narrated video scripts with precise word-level timing.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Convert written documents into narrated video scripts with precise word-level timing.
The agent interprets; the document guides. Rather than rigid template-based splits, this skill uses agent judgment to find where the content naturally breathes, argues, and transitions. The document's argument flow determines scene breaks, not a predetermined structure.
Use this skill when:
Do NOT use this skill when:
tts/model/ (not in git due to size - see Model Setup below)There are two approaches: per-scene (legacy) and full narration (recommended).
Generates a single audio file for consistent volume and pacing:
Document (.md)
↓ [agent interprets scene breaks]
Scene .txt files (01-scene-name.txt, 02-scene-name.txt, ...)
↓ [TTS via narrate-full.py - SINGLE PASS]
full-narration.wav (one consistent audio file)
↓ [Whisper via transcribe-full.py]
full-narration.json + full-narration.vtt (word-level timing)
↓ [extract-scene-boundaries.py]
Scene timing boundaries for video composition
Generates separate audio per scene - can cause volume inconsistencies:
Scene .txt files
↓ [TTS via narrate-scenes.py - MULTIPLE PASSES]
Scene .wav files (volume may vary between scenes)
↓ [concatenate]
Combined audio (may have clipping at boundaries)
Warning: Per-scene TTS generates audio with different volume levels and pacing. When concatenated, this causes audible jumps and clipping. Use the full narration pipeline instead.
cd .claude/skills/document-to-narration
source tts/.venv/bin/activate
# 1. Split document into scenes (manual or scripted)
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
# 2. Generate single audio file
python scripts/narrate-full.py ./output/scenes/
# 3. Transcribe with word-level timestamps
python scripts/transcribe-full.py ./output/full-narration.wav
# 4. Extract scene boundaries for video timing
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
# 1. Split document into scenes
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
# 2. Generate audio per scene (may have volume inconsistencies)
source tts/.venv/bin/activate
python scripts/narrate-scenes.py ./output/scenes/
# 3. Transcribe (DEPRECATED: transcribe-scenes.ts requires whisper-cpp)
# Use transcribe-full.py instead after concatenating audio
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The fine-tuned voice model (~7.8GB) is not included in git due to size.
Place your Qwen3-TTS model files in tts/model/:
tts/model/
├── config.json
├── generation_config.json
├── model.safetensors # Main model weights
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
└── ...
The @remotion/install-whisper-cpp package handles this:
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';
await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });
The skill works best with:
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/
output/essay-name/
├── scenes/
│ ├── 01-opening-hook.txt # Scene script
│ ├── 01-opening-hook.wav # Generated audio
│ ├── 01-opening-hook.vtt # Word-level captions
│ ├── 02-core-argument.txt
│ ├── 02-core-argument.wav
│ ├── 02-core-argument.vtt
│ └── ...
└── manifest.json # Complete timing data
The agent identifies scene breaks using these heuristics:
Pattern: Breaking at every paragraph or heading mechanically. Problem: Ignores argument flow. Scenes feel choppy and disconnected. Fix: Look for rhetorical units, not structural units. Multiple paragraphs often form one scene.
Pattern: Keeping entire sections as single scenes. Problem: Creates TTS audio that's too long. Loses natural breathing room. Fix: Target 100-300 words. Find the natural pause point within sections.
Pattern: Copying written text exactly without spoken adaptation. Problem: Written conventions don't work when spoken. Parentheticals, complex punctuation, and nested clauses confuse TTS and listeners. Fix: Apply adaptation rules. Read it aloud mentally.
Pattern: Rewriting content so heavily it loses the author's voice. Problem: The result doesn't sound like the original author. Fix: Preserve voice, adjust mechanics. If the author uses rhetorical questions, keep them.
Parse a markdown document and output scene text files.
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run
Options:
--output - Directory for scene files (created if doesn't exist)--adapt - Apply spoken adaptation rules--dry-run - Preview scene breaks without writing filesOutput: Numbered .txt files and initial manifest.json
Generate a single TTS audio file from all scene files. Produces consistent volume and pacing.
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav
Options:
--force - Regenerate even if output exists--speaker - Speaker name (default: jwynia)--output - Custom output path (default: ../full-narration.wav)Output: Single full-narration.wav in parent directory of scenes
Generate TTS audio for each scene file separately. Not recommended - can cause volume inconsistencies when concatenated.
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia
Options:
--force - Regenerate even if output exists--speaker - Speaker name (default: jwynia)Output: .wav files alongside each .txt file
Transcribe audio with word-level timestamps using Python's openai-whisper.
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/
Options:
--model - Whisper model: tiny, base, small, medium, large, large-v2, large-v3 (default: medium)--output-dir - Output directory (default: same as audio file)Output:
.vtt file with word-level timestamps.json file with captions array for RemotionDependencies: Requires openai-whisper in Python environment:
pip install openai-whisper
Extract scene timing boundaries from transcript by matching scene opening phrases.
# Human-readable table
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json
# JSON output
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json
# TypeScript for Video.tsx
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
Options:
--json - Output as JSON array--typescript - Output as TypeScript code for Video.tsx scenes arrayOutput: Scene numbers, slugs, start times, and durations
Deprecated: Requires whisper-cpp binary which may not be installed. Use
transcribe-full.pyinstead.
Transcribe per-scene audio files using whisper-cpp.
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/
Output: .vtt files with word-level timestamps
Orchestrate the complete pipeline.
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/
Options:
--output - Output directory (required)--adapt - Apply spoken adaptation--skip-tts - Skip audio generation (text only)--skip-transcribe - Skip Whisper transcription{
"source": "appliance-vs-trade-tool-draft.md",
"created_at": "2024-01-15T10:30:00Z",
"total_scenes": 9,
"total_duration_seconds": 420,
"scenes": [
{
"number": 1,
"slug": "popcorn-opening",
"word_count": 185,
"audio_duration_seconds": 55.2,
"files": {
"text": "scenes/01-popcorn-opening.txt",
"audio": "scenes/01-popcorn-opening.wav",
"captions": "scenes/01-popcorn-opening.vtt"
},
"captions": [
{ "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
{ "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
]
}
]
}
WEBVTT
00:00.000 --> 00:00.180
Two
00:00.180 --> 00:00.450
people
00:00.450 --> 00:00.720
walk
00:00.720 --> 00:01.100
into
When --adapt is enabled, the skill transforms written conventions to spoken equivalents:
| Written | Spoken |
|---|---|
| Parenthetical asides | Em-dash or separate sentence |
| "e.g." | "for example" |
| "i.e." | "that is" |
| Long nested clauses | Split into multiple sentences |
| Semicolons | Periods |
*emphasis* | Context-appropriate stress |
Preserve:
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';
// Use scene durations for Sequence timing
{manifest.scenes.map((scene, i) => (
<Sequence
from={accumulatedFrames}
durationInFrames={scene.audio_duration_seconds * fps}
>
<Audio src={staticFile(scene.files.audio)} />
<CaptionRenderer captions={scene.captions} />
</Sequence>
))}
Whisper requires 16kHz mono WAV. The pipeline handles conversion automatically:
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav
The fine-tuned voice model (~7.8GB) is bundled at tts/model/. Uses Qwen3-TTS with custom speaker embedding.