AI-driven screen recording and demo production pipeline for macOS. Records screen + cursor + window bounds, then uses AI vision to analyze the recording, create a zoom script targeting specific UI elements, generate voiceover narration, and produce a polished demo video. Use when: (1) creating product demo videos, (2) recording and polishing UI walkthroughs, (3) turning raw screen recordings into narrated presentations, (4) re-processing existing recordings with different zoom/voiceover.
npx claudepluginhub abhattacherjee/claude-code-skills --plugin smart-screen-recorderThis skill uses the workspace's default tool permissions.
AI-driven demo video production from raw screen recordings. Records your screen,
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
AI-driven demo video production from raw screen recordings. Records your screen, then uses Claude's vision to analyze the recording, identify narrative moments, create a zoom script targeting actual UI elements, generate voiceover, and produce a polished demo video. Works with any product or UI — the user provides a brief description of what they're demoing and the AI crafts the narrative around it.
# Install dependencies
~/.claude/skills/smart-screen-recorder/scripts/install-deps.sh
# Record (Ctrl+C to stop) — captures screen + cursor + window bounds
~/.claude/skills/smart-screen-recorder/scripts/record.sh
# Then tell Claude: "process my recording into a demo video"
Record ──→ Extract ──→ Voice ──→ Demo Director ──→ TTS ──→ Integrated ──→ Post-Prod
│ Frames Select (AI agent) │ Timeline Review
MKV + as PNGs (user) Writes zoom + OpenAI Renderer (AI agent)
cursor + manifest voiceover + TTS Interleaves PASS or
.jsonl hold_frames nova PLAY + HOLD NEEDS_FIXES
Key architecture (v4.0): Narration-first integrated timeline.
The video and narration are built TOGETHER, not separately:
This replaces the old approach where voiceover was overlaid on a continuously-playing video that rushed past the content being described.
Before starting, create a task checklist so the user always knows where we are.
Use TaskCreate at the start to create these tasks. Mark each in_progress when starting,
completed when done. The user sees this as a live progress indicator.
| # | Task | Description |
|---|---|---|
| 1 | Record screen | Capture raw video + cursor data |
| 2 | Extract frames | Pull key frames as PNGs for AI analysis |
| 3 | Voice & context | Select TTS voice + gather product description |
| 4 | Brainstorm narrative | Present demo theme options, get user direction |
| 5 | Demo Director | AI analyzes frames, creates zoom + voiceover scripts |
| 6 | Verify zoom targets | QA verifier corrects bounding boxes at full resolution |
| 7 | Generate TTS | Create audio segments from voiceover script |
| 8 | Build timeline | Construct integrated PLAY + HOLD segment sequence |
| 9 | Preview & feedback | Serve HTML preview, iterate on user feedback |
| 10 | Render video | Full 4K render from approved timeline |
| 11 | Mix audio | Place TTS segments at precise output timestamps |
| 12 | Post-production | Quality gate: PASS / NEEDS_FIXES / RESHOOT |
Update rules:
in_progress immediately before starting each taskcompleted immediately after it succeedsin_progress while waitingin_progressdeletedSkip tasks that don't apply (e.g., skip Step 1 if user provides existing recording).
~/.claude/skills/smart-screen-recorder/scripts/record.sh --raw-only -o ~/Desktop
Output: {name}-raw.mp4 + {name}-cursor.jsonl. Uses MKV internally (survives
Ctrl+C interruption), then remuxes to MP4.
python3 ~/.claude/skills/smart-screen-recorder/scripts/extract-frames.py \
recording-raw.mp4 cursor.jsonl -o ~/Desktop/zoom-analysis
Before generating anything, ask the user their voice preference. Present these options:
What voice style would you like for the narration?
1. OpenAI TTS (natural, human-like) — requires OpenAI API key
a) nova — warm, engaging female (recommended for product demos)
b) alloy — neutral, versatile
c) echo — deeper male voice
d) shimmer — soft, gentle female
e) onyx — authoritative male
f) fable — expressive, storytelling
2. macOS Native (free, more synthetic)
a) Samantha — standard US female
b) Reed — US male
c) Flo — casual female
Your choice:
If OpenAI selected:
OPENAI_API_KEY is set in the environmentexport OPENAI_API_KEY=sk-..."
Or if user prefers browser auth, open https://platform.openai.com/api-keysSave the preference in the voiceover-script.json so re-processing uses the same voice.
Before launching the Demo Director, ask the user:
What product/feature is this demo showing? Give me a brief description:
- What does the product do?
- Who is the target audience?
- What's the key narrative? (e.g., "show how easy it is to create a project")
This context is passed to the Demo Director so it can craft narration that accurately
describes the product, rather than guessing from screenshots alone. The user's description
becomes the product_context field in the Demo Director prompt.
Before the Demo Director runs, launch the Demo Storyteller agent to craft theme options.
This is a two-part step: the AI proposes, the user chooses.
Part 1: Launch Demo Storyteller agent
Launch a general-purpose sub-agent using the Demo Storyteller persona
(~/.claude/agents/demo-storyteller.md). Pass it:
The agent reads ALL frames, identifies compelling moments, and writes
narrative-themes.json with 3 distinct narrative approaches. Each theme includes:
Part 2: Present to user and capture choice
Present the 3 themes to the user using the Storyteller's formatted summary. The user picks one theme, mixes elements, or provides their own direction.
Capture the result as narrative_brief — a structured object passed to the Demo Director:
{
"chosen_theme": "A",
"theme_name": "The Journey",
"tone": "warm, personal, storytelling",
"opening_line": "Meet the Escape Planner...",
"narrative_arc": "Follow a first-time user from curiosity to delight",
"emphasis": ["questionnaire flow", "AI generation reveal", "tiny home match scores"],
"de_emphasis": ["scrolling between sections", "loading states"],
"user_notes": "Any additional direction from the user"
}
Why an agent instead of hardcoded options: The Storyteller reads the actual frames, so its themes reference real UI elements and screens — not generic templates. A recording of a code editor gets different themes than a recording of a vacation planner.
Launch a general-purpose sub-agent as a Senior Product Demo Director persona.
Pass the user's product description as context.
The agent must read ALL extracted frames and produce zoom-script.json + voiceover-script.json.
zoom-script.json format:
{
"trim": {"start": 27, "end": 125},
"video_resolution": {"w": 6016, "h": 3384},
"default_zoom": 1.0,
"events": [
{
"description": "What UI element and why it matters narratively",
"start": 44, "end": 51,
"zoom": 1.5,
"target_box": {"x": 1750, "y": 250, "w": 2000, "h": 1550},
"target_element": "Interest selection grid with colorful tag pills",
"transition_in": 2.5, "transition_out": 2.0
}
]
}
Demo Director rules:
target_box) that encompass the ENTIRE UI elementtarget_element description so the verification step knows what to look forThe Demo Director's bounding boxes are estimates from thumbnail-sized frames. They are often wrong. This verification step is mandatory.
Launch a second general-purpose sub-agent as a Zoom QA Verifier. For each
zoom event in the script:
target_element described in the zoom eventtarget_box coordinates in the zoom scriptThe verifier should:
trim.start + event.start secondsWhy this step exists: The Demo Director sees 1920px-wide thumbnails but the video is 6016px wide. Even small estimation errors at thumbnail scale become 200-300px misalignment at full resolution, causing the zoom to target the wrong area.
This is the core of v4.0. Instead of overlaying audio on a continuously-playing video, build an integrated timeline that interleaves PLAY and HOLD segments:
timeline = [
{"type": "play", "source_start": 0.0, "source_end": 5.0, "duration": 1.5},
{"type": "hold_narrate", "source_time": 5.0, "hold_duration": 5.3,
"narration": "Meet the app. Here's what it does...", "tts_file": "seg_00.mp3"},
{"type": "play", "source_start": 6.0, "source_end": 10.0, "duration": 1.5},
{"type": "hold_narrate", "source_time": 10.0, "hold_duration": 4.8,
"narration": "It starts with a simple setup flow...", "tts_file": "seg_02.mp3"},
...
]
How to build the timeline:
start_time (source timeline)Target output duration: Source duration + ~40% for holds. A 98s source → ~135s output.
Generate TTS segments (OpenAI or macOS), then render the integrated timeline:
play: read source frames and write to encoderhold_narrate: read ONE source frame, write it N times (freeze), record TTS placementThe renderer tracks tts_placement — a list of {file, output_time, duration} entries
that tell ffmpeg where to place each audio segment in the final mix.
Before spending 5+ minutes on a full render, generate an HTML preview.
python3 ~/.claude/skills/smart-screen-recorder/scripts/preview-timeline.py \
raw.mp4 zoom-script.json integrated-timeline.json tts/ -o preview/
This opens a localhost page (http://localhost:8111) showing:
The user reviews and provides feedback (e.g., "Hold 3 narration mentions tiny homes but the frame shows excursions — move to Hold 6"). Adjust the zoom-script.json and voiceover-script.json based on feedback, then rebuild the timeline and re-preview.
Only proceed to full render after the user approves the preview.
Launch the Post-Production Editor agent (demo-post-production-editor) to review
the final output. The editor:
If NEEDS_FIXES: Apply the editor's recommended changes:
adjust_zoom → update zoom-script.json bounding boxes, re-run Step 6rewrite_voiceover → update voiceover-script.json, regenerate TTSshift_timing → adjust zoom start/end timesadd_pause → insert silence beats in voiceoverIf RESHOOT: The recording itself is inadequate. Inform the user and offer to re-record with guidance on what to demo differently.
If PASS: The demo is ready to ship.
Both scripts are editable JSON. To re-process without re-recording:
# Edit zoom-script.json or voiceover-script.json
python3 ~/.claude/skills/smart-screen-recorder/scripts/apply-zoom-script.py \
raw.mp4 zoom-script.json -o output.mp4 --resolution 3840x2160
| Agent | File | Model | Purpose |
|---|---|---|---|
| Demo Storyteller | ~/.claude/agents/demo-storyteller.md | sonnet | Analyzes frames, proposes 3 narrative themes for user to choose from |
| Demo Director | ~/.claude/agents/demo-director.md | opus | Analyzes all frames + narrative brief, creates zoom-script.json + voiceover-script.json |
| Zoom QA Verifier | ~/.claude/agents/zoom-qa-verifier.md | opus | Extracts full-res frames at zoom timestamps, corrects bounding boxes |
| Voiceover Timing Fixer | ~/.claude/agents/voiceover-timing-fixer.md | sonnet | Detects TTS audio overlaps, rebuilds sequential timestamps |
| Post-Production Editor | ~/.claude/agents/demo-post-production-editor.md | opus | Reviews final output for quality, requests re-cuts if needed |
All agents are NOT user-invocable — spawned by the skill orchestrator.
Sub-Agent Registry:
| Phase | Agent | Concurrency | Input | Output |
|---|---|---|---|---|
| Step 3.7 | Demo Storyteller | Sequential | Frames + product context | narrative-themes.json (3 options) |
| Step 4 | Demo Director | Sequential (after user picks theme) | Frames + narrative brief | zoom-script.json, voiceover-script.json |
| Step 5 | Zoom QA Verifier | Sequential (after Step 4) | zoom-script.json + raw video | Corrected zoom-script.json |
| Step 7b | Voiceover Timing Fixer | Sequential (after Step 7) | TTS audio files + manifest | Fixed manifest with 0 overlaps |
| Step 8 | Post-Production Editor | Sequential (after merge) | Final video + zoom/VO scripts | PASS/NEEDS_FIXES/RESHOOT verdict |
When Agent Teams are enabled (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1), the pipeline can use persistent teammates instead of one-shot sub-agents.
The standard pipeline launches 5 sequential sub-agents — each starts fresh with no memory of prior phases. Teams change this:
Persistent creative team: Instead of destroying agents between phases, teammates persist across the session. The Director can refer back to the Storyteller's themes. The QA Verifier can ask the Director about intent behind a zoom target. The Post-Production Editor can request the Timing Fixer to adjust specific segments — all without re-explaining context.
Parallel iteration: During the preview-feedback cycle (Step 9), multiple teammates can work simultaneously — one regenerating TTS clips for segments the user flagged, while another adjusts zoom targets, and a third rewrites narration for a different section.
Cross-phase communication: When the Post-Production Editor returns NEEDS_FIXES, it can message the Director directly about which narration segments need rewriting, rather than the orchestrator relaying instructions.
TeamCreate("demo-production")
├── storyteller — brainstorms themes, stays available for creative reference
├── director — creates zoom + voiceover scripts, iterates on feedback
├── qa-verifier — validates zoom targets, can ask director about intent
└── Lead orchestrates phases, manages user feedback, coordinates iteration
The Voiceover Timing Fixer and Post-Production Editor roles are handled by the lead or existing teammates, since their work is tightly coupled with the Director's output.
| Scenario | Recommendation |
|---|---|
| Single recording, no iteration | Sub-agents (simpler) |
| Multiple recordings in one session | Teams (reuse creative direction) |
| Heavy preview-feedback iteration | Teams (parallel fixes) |
| User wants to re-process with different narrative | Teams (Storyteller remembers previous themes) |
| Script | Purpose |
|---|---|
record.sh | Record screen + cursor + window bounds (MKV → MP4) |
cursor-tracker.py | Track cursor, clicks, active window via Quartz API |
extract-frames.py | Extract key frames as PNGs for AI analysis |
apply-zoom-script.py | Apply zoom script with trim, bounding boxes, 4K output |
generate-tts.py | Generate OpenAI/macOS TTS audio from voiceover script |
build-timeline.py | Build integrated PLAY+HOLD timeline from zoom + TTS |
render-timeline.py | Render video from integrated timeline with zoom |
mix-audio.py | Mix TTS audio segments into rendered video at timestamps |
smart-zoom.py | Legacy heuristic zoom modes (focus/click/velocity) |
install-deps.sh | Install ffmpeg, pyobjc, opencv, numpy |
| Dependency | Install | Purpose |
|---|---|---|
| ffmpeg | brew install ffmpeg | Screen capture + video encoding |
| pyobjc-framework-Quartz | pip3 install pyobjc-framework-Quartz | Cursor + window tracking |
| opencv-python | pip3 install opencv-python | Frame extraction + processing |
| numpy | (with opencv) | Array operations |
| OPENAI_API_KEY (optional) | export OPENAI_API_KEY=sk-... | Natural TTS voices |
macOS only. Requires Screen Recording permission for Terminal. Click tracking requires Accessibility permission.
-pixel_format nv12 — the default yuv420p isn't
supported as input. Capture in nv12, encode to yuv420p on output.nova voice is excellent for demos — dramatically better than
macOS say. But always ask the user their preference first.fable speaks ~15% slower than
nova. Always measure actual audio duration after generation and rebuild
timestamps sequentially to prevent overlaps. Never trust the Demo Director's
estimated start_time values — they're based on estimated duration, not actual.target_box
with padding, adjusted to output aspect ratio. This naturally centers the element
and picks the right zoom level.