Search everything...

Skill

explainer

~60-80s explainer video for any URL — GitHub repo, product page, docs site, blog post, or launch. CANONICAL workflow for URL walkthroughs. Use when the user asks to "explain this URL / repo / website / product", "make a walkthrough video for [url]", "demo this site", "Loom-style explainer of [url]", "explainer for github.com/...", or "explain this product link". Drives a real browser through the URL, generates an avatar lipsync, and composites in a 1280×800 macOS Sonoma frame with a 246-pixel bottom-left avatar circle. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); other URLs use a generic page-walkthrough flow.

npx claudepluginhub pika-labs/pika-plugins --plugin pika

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs ...

SKILL.md

Similar Skills

walkthrough-video

774

Generates MP4 walkthrough videos from app screenshots or live sites using Remotion. Adds smooth transitions, zoom effects, text overlays, progress bars, optional voiceover narration for demos, showcases, docs.

1 file6 tools

frontend

webconsulting-create-documentation

Creates product documentation: React help pages, AI-generated screenshots, Remotion videos with TTS narration and music, GitHub README enhancements. Use for docs, help pages, product tours, screenshots, user guides.

2 files

craft-workspace-webconsulting-skills

vid-sizzle-reel

223

Generates high-energy sizzle reel MP4 videos from brand assets and key messages via HyperFrames using GSAP animations, headless Chromium rendering, and FFmpeg encoding. For hype videos, event promos, or investor pitches.

3 files

opendirectory-gtm-skills

Stats

Stars31

Forks4

Last CommitMay 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

explainer | pika | ClaudePluginHub

Back to Skills

Skill

explainer

From pika

npx claudepluginhub pika-labs/pika-plugins --plugin pika

Tool Access

This skill uses the workspace's default tool permissions.

Preview

SKILL.md

/pika:explainer

Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]

Behavior

Defaults — fire fast, no mid-flow confirmation

Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (--avatar, --voice) when supplied; otherwise resolve via identity_avatar_url / identity_voice_id and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via --preview (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
Do not solicit --focus either. Make a confident first attempt from page structure; users re-run with --focus "X" if the angle missed.

These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.

Local avatar images on Claude Desktop

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as --avatar, pause Step 1 and kindly send them this — something like:

Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:

Paste a URL if it's already hosted (Imgur, S3, your site) — fastest

Zip it and attach the .zip — right-click → Compress (macOS) / Send to → Compressed folder (Windows) / zip pic.zip pic.png (Linux). I'll take it from there.

When a .zip arrives, unzip it via Bash, call upload_asset for a presigned PUT URL, push the bytes with curl -X PUT, then use the returned public_url as --avatar <url> — all before Step 1. Already-hosted https://... URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.

Step 0 — Resolve URL (empty-args menu)

Strip flags (--focus, --avatar, --voice, --live-url, --lipsync-provider, --no-captions, --preview, --skip-preview, --yes) and key=value parameters from $ARGUMENTS. If what remains contains no https://... URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message — do NOT call any tool, do NOT proceed to Step 1, do NOT invent a URL. If $ARGUMENTS already carries a URL, skip this step silently and proceed to Step 1.

Which URL would you like me to walk through? Works on any of:

A GitHub repo — e.g. https://github.com/anthropics/claude-code (activates repo-aware mode: README scan + live-demo detection)

A product page / launch page — e.g. https://pika.art

A docs site — e.g. https://docs.anthropic.com

A blog post / article URL

Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass --preview if you want a 3-second lipsync sanity check first.

Reply with the URL and I'll start.

Tip: you don't need to type /pika:explainer — just say things like "walk me through ", "make a demo video of ", or "explain this repo: " and I'll fire this skill automatically.

When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.

Step 1 — Parse input + detect mode

Required: url (must be https://...). Optional: --avatar <url> (overrides identity-store default), --voice <minimax-voice-id>, --focus "..." (editorial guidance woven into vo_text), --live-url <url> (force-supply live demo URL — GitHub mode only), --lipsync-provider <pika|kling> (defaults to pika — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass kling for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), --no-captions (skip the Step 11 caption burn — default is captions on), --preview (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). --skip-preview and --yes are accepted as no-ops for backward compatibility.

Mode detection:

GitHub mode — URL host is github.com AND path matches /{owner}/{repo} (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like /blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.

Avatar resolution (silent — never ask the user):

If --avatar <url> was passed, use it.
Else call mcp__pika__identity_avatar_url. If non-null, use it.
Else (fresh user, no identity avatar set yet) call mcp__pika__generate_image once with prompt "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting" and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.

Voice resolution (silent — never ask the user):

If --voice <id> was passed, use it.
Else call mcp__pika__identity_voice_id. If non-null, use it.
Else pick a casual MiniMax speech-2.8-hd preset matching the resolved avatar's apparent gender:
- Female-coded avatar → English_PlayfulGirl (warm, casual, clearly female-voiced — verified)
- Male-coded avatar → English_Jovialman (warm, casual male)
- Unclear / gender-neutral → English_Jovialman (default)
Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.

Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.

The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.

Step 2 — Read source (no MCP call)

Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).

GitHub mode additions: also fetch top-level file tree, (best-effort) package.json / pyproject.toml, and GitHub API repo metadata via gh api repos/{owner}/{repo} for homepage, description, language, topics. Detect a candidate live_url in this priority:

User-supplied --live-url.
GitHub API meta.homepage field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball repo_analyzer.py:66-77).
package.json "homepage" field.
First match in README of https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*.
Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g. <project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback.
GitHub Pages convention https://{owner}.github.io/{repo} — but only if the deep tree contains a frontend signal (one of index.html, App.tsx, App.jsx, App.vue, app.py, main.py).

If no candidate resolves, the beat sheet skips beats 6–7.

Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.

Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:

If the response status is 4xx / 5xx, drop live_url to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains "There isn't a GitHub Pages site here"), drop live_url and skip beats 6–7.
Otherwise, keep live_url for beats 6–7.

This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.

A. Bot-block / captcha detection — abort if matched:

If the response body contains any of:

"Verify you are human" / "verify you are not a robot"
"captcha" / "CAPTCHA" / "reCAPTCHA"
"403 Forbidden" / "Access Denied"
"Just a moment" + cf-chl-bypass (Cloudflare challenge)
"We're sorry, something went wrong" (Amazon-style bot block)
A <title> or h1 of just "Robot Check" / "Are you a robot?"

→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."

B. Cookie / consent-banner detection — defuse with extra_css + optional click:

Scan the response for these patterns (case-insensitive):

IDs / classes starting with onetrust-, truste-, cookie-banner, cookie-consent, gdpr-, consent-, cmp-
Buttons matching (?i)accept (all )?cookies / (?i)agree.{0,10}cookies / (?i)i (accept|agree)
Apple-specific banner: id ac-gdpr-banner or class as-globalfooter-curtain
Google consent: [role="dialog"] with text "Before you continue"

If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:

CSS injection (extra_css) in the capture_website call to hide common banners universally — even if the click below misses, the banner is visually gone.
A click timed_action at at_s: 0.0 against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. #onetrust-accept-btn-handler, [aria-label*="Accept all" i], button[id*="accept"]).

The extra_css payload (use this verbatim — covers ~80% of consent platforms):

#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }

C. Real-DOM element identification — emit concrete selectors:

Generic CSS selectors (h1, [class*="hero"], section h2) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses tile-headline / as-headline-section-title, not hero-*). For each beat, prefer the actual DOM elements observed in the WebFetch:

Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary <h1> text and class.
Note the page's section structure (h2 headings + their parent containers).
Note any prominent CTA / signup / pricing element.
Emit zoom_target.selector using the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind _1a2b3c, CSS modules module__hero___xYz).

D. SPA / lazy-render detection — bump initial wait:

If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-domcontentloaded. Emit a longer initial wait action ({type: "wait", at_s: 0.0, ms: 2500}) before any beat fires, instead of the default 600ms settle.

E. --focus is honored when supplied (do not solicit):

Without --focus, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with --focus "the X feature" if the first pass misses the angle they wanted. With --focus supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:

README contains...	Required beat
`how it works`	scroll_to that heading; zoom `article h2:has(#user-content-how-it-works)`
`audio layer` / `audio timeline`	scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading
`claude code` / `mcp integration`	scroll_to that section; zoom `article pre` or `.highlight` (terminal screenshot / code block)
`architecture` / `system design`	scroll_to that section; zoom `article h2:has(#user-content-architecture)`
`features` (when prominent at top)	scroll_to that heading; zoom `article h2:has(#user-content-features)`
`getting started` / `quick start` / `installation`	scroll_to that heading; zoom `article h2:has(#user-content-installation)` (or the matching slug) — falls back to `article pre` if you want the install code block instead
`usage` / `examples`	scroll_to that heading; zoom `article h2:has(#user-content-usage)` (or the matching slug) — or the first code block under it

GitHub heading slug rule: lowercase, spaces → dashes, strip non-[a-z0-9-] characters. So "How it works" → #user-content-how-it-works, "Quick Start" → #user-content-quick-start. GitHub injects the <a id="user-content-{slug}"> anchor inside each rendered <hN>, so hN:has(#user-content-{slug}) reliably grabs the heading element across any GitHub README.

Selector contract: bbox_selector MUST be vanilla CSS that resolves via document.querySelector (capture_website runs the post-action smooth-scroll JS via page.evaluate, which uses the browser's native selector engine). Do NOT use Playwright extensions like :has-text("..."), text=..., or :visible — those resolve in Playwright's page.query_selector (so the bbox capture finds the element) but silently fail in the smooth-scroll's document.querySelector (so the page never scrolls to the target, and bbox.y ends up at document-Y instead of top - 60 px, which trips Step 8b's bbox.y > recording_viewport.h degenerate filter and falls back to default-position zoom). CSS Level 4 :has(...) IS vanilla and supported in modern Chromium.

These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with SPECIAL rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.

Step 3 — Author beat sheet (main thread, no MCP call)

Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:

{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}

Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):

Every beat MUST have all five fields: t_start, t_end, action (with type and url), zoom_target (with selector), vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball's github_explainer.py:183-190 validation pass.)
t_start of beat 0 = 0.0; t_end[i] == t_start[i+1] (continuity).
len(vo_text.split()) / 2.5 ≈ t_end - t_start per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the vo_text until it fits.
Total t_end of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
Total spoken word count between 165 and 200 words.
Every beat's zoom_target.selector MUST be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors: h1.f1, #readme, article h2, .blob-code-inner, .highlight, .octicon-star, nav. Generic-URL mode prefers robust generic selectors: h1, [role="main"], main, header, nav, .hero, .feature, section h2, [class*="cta"], [class*="hero"], button, a[href]. Selectors must resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
vo_text is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
action.url is a valid https://... URL when action.type == "navigate"; required.

Self-check before Step 4: verify total_words is in [165, 200] AND total_seconds (= beats[-1].t_end) is in [65, 80]. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)

Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):

Beat 1: navigate repo root, zoom h1.f1 (repo title), hook sentence.
Beats 2–3: navigate to specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom .blob-code-inner or .highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
Beats 4–5: scroll_to README sections, zoom article h2 or #readme. If Step 3.0 surfaced required sections, replace these slots with the required ones.
Beats 6–7 (only if live_url survived Step 2.5): navigate to live_url, zoom nav / h1 / .hero / main / button / .feature.
Beat 8: back to repo root, zoom .octicon-star, outro.

Structural skeleton — Generic-URL mode:

Beat 1: navigate to the input URL, zoom h1 or [class*="hero"] h1 (the page's primary headline), hook sentence.
Beats 2–3: scroll_to the page's hero / value-prop / first feature section. Zoom .hero, [class*="hero"], [class*="feature"], or section:nth-of-type(1) h2. Pick visible elements the narration references.
Beats 4–5: scroll_to deeper sections — feature lists, screenshots, pricing, social proof. Zoom section h2, [class*="feature"] img, [class*="testimonial"], [class*="pricing"], or any prominent semantic element on the page.
Beats 6–7: scroll_to CTA / signup / demo embed. Zoom [class*="cta"], button, a[class*="button"], or [id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.)
Beat 8: scroll_to footer / closing element, zoom footer h2, footer, or back to top with h1. Outro sentence.

If --focus is supplied, weave its angles into vo_text without mutating the structural skeleton. Prefer CSS selectors over text_content in zoom_target.selector — bbox capture is selector-only (see Known gaps).

Step 4 — TTS

Call mcp__pika__generate_speech with provider: "minimax-tts", text: <full vo_text join>, optional voice_id. Capture result.audio_url (the dispatcher returns audio under audio_url, not url) and result.duration_seconds. Voice defaults to identity-store injection in plugin mode.

Step 4.5 — Audio length verification

The TTS engine's actual output rate often diverges from the 2.5 wps estimate in Step 3. Verify before committing to the long-pole lipsync step.

If |audio_duration_seconds - beats[-1].t_end| > 10 seconds, abort and re-author the beat sheet with a tighter / looser word budget.
If audio_duration_seconds > 90, abort regardless — Kling avatar/image2video stalls reliably past 90s of audio under current load.

This gate exists to catch length drift before it costs 20 minutes of Kling timeout. Tarball had an equivalent verification table.

Step 5 — Preview gate (opt-in via `--preview`)

Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed --preview — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.

--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.

If --preview was supplied:

mcp__pika__generate_speech with text: "Hi, I'm your presenter. Let's explore this repo together." → preview_audio_url.
mcp__pika__generate_lipsync with provider: <resolved_lipsync_provider> (defaults to pika; honor --lipsync-provider kling if supplied), image: <avatar>, audio: preview_audio_url → preview_lipsync_url (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
Present to the user verbatim:

Preview ready: <preview_lipsync_url> This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply yes to proceed, or anything else to cancel.
Match ^(yes|go|proceed|confirm|y)$ (case-insensitive). Anything else → STOP, no further MCP calls.

Step 6 — Build `timed_actions` and record

Translate the beat sheet into capture_website timed_actions. One timed_action per beat — set bbox_selector to the beat's zoom_target.selector and capture_website captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-top - 60 px → 1300 ms post-anim → measure, all server-side).

For each beat in order, emit one entry:

navigate beats: {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absolute at_s + 0.6 s, scrolls bbox_selector into view, and measures the bbox — all without the caller scheduling a follow-up step.
scroll_to / hover beats: {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's own selector drives the page scroll; bbox_selector drives the bbox measurement (it can be the same selector or different — usually the same). (capture_website has no hover; scroll-into-view is the analog.)

Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from t=0 of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at t=0 with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.

Call mcp__pika__capture_website:

url: <beat 0's action.url>
timed_actions: <the N-element list built above> (one entry per beat)
duration_s: max(ceil(beats[-1].t_end), ceil(audio_duration_seconds)) — covers both the beat budget AND any TTS overrun. MiniMax-TTS commonly produces audio ~5-10% longer than the 2.5 wps estimate (Step 4.5 already gates drift > 10s); using the max ensures the screen recording covers the full lipsync, otherwise edit_pip's shortest=1 would clip the recording's tail and you'd lose the last few seconds of audio with no screen behind it.

Generic-URL mode additions (per Step 2.6 pre-flight):

extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B> — defensive: hides common consent platforms via display: none !important; so even if the optional click misses, the banner is invisible in the recording.
Prepend a wait action {type: "wait", at_s: 0.0, ms: 2500} for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
If cookie_banner_present from Step 2.6 §B, also prepend a click action {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>} and shift all beat t_start / t_end values by +1.5s to compensate. Beat 1 navigates / scrolls at t_start: 1.5 (or whatever offset accommodates the dismissal animation). The lipsync audio also needs to start with a 1.5s lead-in pause — easiest to just have beat 1's vo_text begin with a half-second pause-friendly opener like "Alright," or "So,", or pad the audio externally before lipsync.
No cookie banner shifts needed if cookie_banner_present == false; just the prepended wait action.

Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.

action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.

GitHub mode: with one timed_action per beat, idx maps 1:1 to beat index — Step 8 uses entry.idx directly as beat_idx.
Generic-URL mode: the prepended wait (and optional cookie-dismissal click) shift the array by 1 or 2. Compute beat_idx = entry.idx - prepend_count where prepend_count is 1 (wait only) or 2 (wait + click). Skip entries where beat_idx < 0 (those are the prepended setup actions, not beats).

The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.

Step 7 — Browser chrome

mcp__pika__edit_browser_frame:

video_url: <Step 6 video_url>
url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)
tab_title: <30-char title> — GitHub mode: (meta.description or repo_name or "")[:30]. Generic-URL mode: the page's <title> (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against None/empty.

Returns framed_url (1280×800 Sonoma + chrome).

Step 8 — Build `zoom_keyframes` and apply

Constants:

INTRO_BEATS = 2 — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
HOLD_GAP = 0.6 — seconds of 1.0× before each zoom-in and after each zoom-out.
MIN_BEAT_DUR = 1.5 — beats shorter than this are skipped (no room for a meaningful zoom).
SCALE = 1.35 (precise element-targeted zoom).
FALLBACK_SCALE = 1.25 (default-position fallback when no usable bbox).
FALLBACK_RAMP = 0.4.

edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).

Coord transform (recording px → framed px):

cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)

Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:

Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.

Constants for the default position:

DEFAULT_CX = 56 + 1168 // 2 (screen center of the framed canvas)
DEFAULT_CY = 108 + 637 // 3 (upper-third of the content area, where most GitHub UI prominence lives)

Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:

If t_end - t_start < MIN_BEAT_DUR (1.5s), skip — too short for a meaningful zoom.
Compute the keyframe's interior interval as [t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip.
Otherwise pre-fill that beat's slot in a per-beat map (call it zoom_keyframes_by_beat[beat_idx]) with {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)} plus the trimmed t_start/t_end.

Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.

For each entry in action_bboxes:

beat_idx = entry.idx (since Step 6 emits one timed_action per beat). If beat_idx < INTRO_BEATS, skip.
If entry.found is false, skip.
If the beat isn't already in zoom_keyframes_by_beat (was filtered out in Step 8a by MIN_BEAT_DUR/1.0s rules), skip.
Filter degenerate bboxes: skip if bbox.y > recording_viewport.h (offscreen capture — page didn't scroll the element into view in time) or bbox.h > recording_viewport.h * 1.5 (full-page <main> element — yields a meaningless zoom center).
Compute cx_framed/cy_framed from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.

Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.

This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.

If len(zoom_keyframes) > 0, call mcp__pika__edit_animate_zoom with video_url: framed_url, zoom_keyframes. Returns zoomed_url. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use framed_url as zoomed_url.

Step 9 — Lipsync the full audio

mcp__pika__generate_lipsync:

provider: <resolved_lipsync_provider> — default: pika (parrot a2v). Honor --lipsync-provider kling if explicitly passed.
image: <avatar>
audio: <Step 4 audio_url>

Provider tradeoffs:

Provider	Wall-clock	Head motion	When to use
`pika` (default)	~2–5 min	Slightly more dramatic, naturalistic	Default for most runs — fast iteration, watchable output, ~10× faster than kling
`kling` (opt-in)	~5–30 min	Minimal, face-centered, presenter-style	High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole

Server-side-await covers the call inline; if the response shape is {task_id, status: "queued"}, poll mcp__pika__task_status in a tight loop (no sleep) until the status reaches a terminal state (done, failed, or cancelled). On done, capture lipsync_url. On failed / cancelled, fall back to the other provider (kling ↔ pika) per the failover note below.

Failover:

If pika fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with provider: "kling".
If kling stalls past the worker's 1200s ceiling (visible as repeated processing status with no completion) → fall back to provider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.

Why pika is the default:

Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.

For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.

Step 10 — PiP composite

mcp__pika__edit_pip:

main_video_url: <zoomed_url>
overlay_video_url: <lipsync_url>
shape: "circle"
size_px: 246 ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
stroke_width_px: 3
stroke_color: "white"
position_px: {x: 20, y: 476} ← 800 − 246 − 78 for dock clearance (matches tarball's H − CIRCLE_OUT − 78)

Do NOT pass size — size_px and size are mutually exclusive. Returns final_url.

Master-duration / audio-source contract (matching tarball github_explainer.py:418-419, 531-533, 578-582): edit_pip uses shortest=1 semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's duration_s = max(ceil(beats[-1].t_end), ceil(audio_duration_seconds)) ensures the screen recording is ≥ the lipsync, so the composite duration is set by the lipsync. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone audio_url is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.

Step 11 — Burn captions

Call mcp__pika__add_captions(video_url=<final_url>, style="classic"). classic renders a bottom subtitle bar — the right register for an explainer video (use tiktok / hormozi / karaoke only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as captioned_url.

Skip this step only if the user passed --no-captions (parsed in Step 1) — the default is captions on. (Note: /pika:podcast does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)

Step 12 — Return

Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.

Known gaps (carried as follow-up server-side work)

Kling avatar mode:"pro" and prompt not exposed. Tarball calls Kling directly with {mode:"pro", prompt:"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"}. The Pika MCP generate_lipsync wrapper drops both for the kling provider (schema says prompt is "parrot only"; mode is hardcoded). Real quality lever for reducing dramatic head motion in the lipsync. Server PR follow-up: surface prompt and mode on generate_lipsync for kling.
No white-frame trim on the screen recording. Tarball's recorder.py:165-185 detects mean-brightness > 245 in the first 4s and trims with ffmpeg. capture_website has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms wait action at at_s: 0.0 mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
No networkidle wait on per-beat navigation. Tarball uses page.goto(url, wait_until="networkidle", timeout=20000) plus wait_for_timeout(600) after every navigate. capture_website settles to domcontentloaded plus the bbox-capture branch's 600 ms post-action settle (server-side, when bbox_selector is set), but SPA blob pages whose final render happens after domcontentloaded can still get bbox'd against unmounted code blocks. Worker enhancement: expose a wait_until knob on timed_actions[].navigate.
No per-step output-size verification gates. Tarball's verify() helper at github_explainer.py:35-39 checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra mcp__pika__analyze_media call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
text_content bbox capture not implemented. capture_website v1 returns action_bboxes only for steps with a CSS selector. text_content-only steps produce no entry. Prefer CSS selectors in zoom_target for guaranteed zoom coverage.
Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (<h1> + clear <section> + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names — tile-headline instead of hero-title defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's scrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with loading="lazy" may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass --focus "the X feature" to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + wait_until=networkidle + animation-trigger via IntersectionObserver polyfill).
Cookie-banner click is single-attempt. Step 2.6 §B emits one click against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the extra_css payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per click action so the worker tries each in order.

Auth

If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for @pika.art Google login). For non-interactive environments, set MCP_AUTH_TOKEN.

Examples

GitHub-mode (repo-aware: README scan + live-demo detection):

/pika:explainer https://github.com/leigest519/OpenGame
/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
/pika:explainer https://github.com/openai/whisper --preview (opt-in to the preview gate when testing a new avatar)

Generic-URL mode (any non-GitHub URL — drives through the page directly):

/pika:explainer https://pika.art
/pika:explainer https://linear.app --focus "the cycle planning view"
/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview

Similar Skills

walkthrough-video

774

1 file6 tools

frontend

webconsulting-create-documentation

2 files

craft-workspace-webconsulting-skills

vid-sizzle-reel

223

3 files

opendirectory-gtm-skills

Stats

Stars31

Forks4

Last CommitMay 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

/pika:explainer

Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]

Behavior

Defaults — fire fast, no mid-flow confirmation

Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (--avatar, --voice) when supplied; otherwise resolve via identity_avatar_url / identity_voice_id and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via --preview (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
Do not solicit --focus either. Make a confident first attempt from page structure; users re-run with --focus "X" if the angle missed.

Local avatar images on Claude Desktop

Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:

Paste a URL if it's already hosted (Imgur, S3, your site) — fastest

Zip it and attach the .zip — right-click → Compress (macOS) / Send to → Compressed folder (Windows) / zip pic.zip pic.png (Linux). I'll take it from there.

Step 0 — Resolve URL (empty-args menu)

Which URL would you like me to walk through? Works on any of:

A GitHub repo — e.g. https://github.com/anthropics/claude-code (activates repo-aware mode: README scan + live-demo detection)

A product page / launch page — e.g. https://pika.art

A docs site — e.g. https://docs.anthropic.com

A blog post / article URL

Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass --preview if you want a 3-second lipsync sanity check first.

Reply with the URL and I'll start.

Tip: you don't need to type /pika:explainer — just say things like "walk me through ", "make a demo video of ", or "explain this repo: " and I'll fire this skill automatically.

When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.

Step 1 — Parse input + detect mode

Mode detection:

GitHub mode — URL host is github.com AND path matches /{owner}/{repo} (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like /blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.

Avatar resolution (silent — never ask the user):

If --avatar <url> was passed, use it.
Else call mcp__pika__identity_avatar_url. If non-null, use it.
Else (fresh user, no identity avatar set yet) call mcp__pika__generate_image once with prompt "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting" and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.

Voice resolution (silent — never ask the user):

If --voice <id> was passed, use it.
Else call mcp__pika__identity_voice_id. If non-null, use it.
Else pick a casual MiniMax speech-2.8-hd preset matching the resolved avatar's apparent gender:
- Female-coded avatar → English_PlayfulGirl (warm, casual, clearly female-voiced — verified)
- Male-coded avatar → English_Jovialman (warm, casual male)
- Unclear / gender-neutral → English_Jovialman (default)
Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.

Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.

The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.

Step 2 — Read source (no MCP call)

Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).

User-supplied --live-url.
GitHub API meta.homepage field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball repo_analyzer.py:66-77).
package.json "homepage" field.
First match in README of https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*.
Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g. <project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback.
GitHub Pages convention https://{owner}.github.io/{repo} — but only if the deep tree contains a frontend signal (one of index.html, App.tsx, App.jsx, App.vue, app.py, main.py).

If no candidate resolves, the beat sheet skips beats 6–7.

Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.

Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:

If the response status is 4xx / 5xx, drop live_url to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains "There isn't a GitHub Pages site here"), drop live_url and skip beats 6–7.
Otherwise, keep live_url for beats 6–7.

This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

A. Bot-block / captcha detection — abort if matched:

If the response body contains any of:

"Verify you are human" / "verify you are not a robot"
"captcha" / "CAPTCHA" / "reCAPTCHA"
"403 Forbidden" / "Access Denied"
"Just a moment" + cf-chl-bypass (Cloudflare challenge)
"We're sorry, something went wrong" (Amazon-style bot block)
A <title> or h1 of just "Robot Check" / "Are you a robot?"

B. Cookie / consent-banner detection — defuse with extra_css + optional click:

Scan the response for these patterns (case-insensitive):

IDs / classes starting with onetrust-, truste-, cookie-banner, cookie-consent, gdpr-, consent-, cmp-
Buttons matching (?i)accept (all )?cookies / (?i)agree.{0,10}cookies / (?i)i (accept|agree)
Apple-specific banner: id ac-gdpr-banner or class as-globalfooter-curtain
Google consent: [role="dialog"] with text "Before you continue"

If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:

CSS injection (extra_css) in the capture_website call to hide common banners universally — even if the click below misses, the banner is visually gone.
A click timed_action at at_s: 0.0 against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. #onetrust-accept-btn-handler, [aria-label*="Accept all" i], button[id*="accept"]).

The extra_css payload (use this verbatim — covers ~80% of consent platforms):

#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }

C. Real-DOM element identification — emit concrete selectors:

Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary <h1> text and class.
Note the page's section structure (h2 headings + their parent containers).
Note any prominent CTA / signup / pricing element.
Emit zoom_target.selector using the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind _1a2b3c, CSS modules module__hero___xYz).

D. SPA / lazy-render detection — bump initial wait:

E. --focus is honored when supplied (do not solicit):

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

README contains...	Required beat
`how it works`	scroll_to that heading; zoom `article h2:has(#user-content-how-it-works)`
`audio layer` / `audio timeline`	scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading
`claude code` / `mcp integration`	scroll_to that section; zoom `article pre` or `.highlight` (terminal screenshot / code block)
`architecture` / `system design`	scroll_to that section; zoom `article h2:has(#user-content-architecture)`
`features` (when prominent at top)	scroll_to that heading; zoom `article h2:has(#user-content-features)`
`getting started` / `quick start` / `installation`	scroll_to that heading; zoom `article h2:has(#user-content-installation)` (or the matching slug) — falls back to `article pre` if you want the install code block instead
`usage` / `examples`	scroll_to that heading; zoom `article h2:has(#user-content-usage)` (or the matching slug) — or the first code block under it

Step 3 — Author beat sheet (main thread, no MCP call)

Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:

{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}

Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):

Every beat MUST have all five fields: t_start, t_end, action (with type and url), zoom_target (with selector), vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball's github_explainer.py:183-190 validation pass.)
t_start of beat 0 = 0.0; t_end[i] == t_start[i+1] (continuity).
len(vo_text.split()) / 2.5 ≈ t_end - t_start per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the vo_text until it fits.
Total t_end of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
Total spoken word count between 165 and 200 words.
Every beat's zoom_target.selector MUST be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors: h1.f1, #readme, article h2, .blob-code-inner, .highlight, .octicon-star, nav. Generic-URL mode prefers robust generic selectors: h1, [role="main"], main, header, nav, .hero, .feature, section h2, [class*="cta"], [class*="hero"], button, a[href]. Selectors must resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
vo_text is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
action.url is a valid https://... URL when action.type == "navigate"; required.

Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):

Beat 1: navigate repo root, zoom h1.f1 (repo title), hook sentence.
Beats 2–3: navigate to specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom .blob-code-inner or .highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
Beats 4–5: scroll_to README sections, zoom article h2 or #readme. If Step 3.0 surfaced required sections, replace these slots with the required ones.
Beats 6–7 (only if live_url survived Step 2.5): navigate to live_url, zoom nav / h1 / .hero / main / button / .feature.
Beat 8: back to repo root, zoom .octicon-star, outro.

Structural skeleton — Generic-URL mode:

Beat 1: navigate to the input URL, zoom h1 or [class*="hero"] h1 (the page's primary headline), hook sentence.
Beats 2–3: scroll_to the page's hero / value-prop / first feature section. Zoom .hero, [class*="hero"], [class*="feature"], or section:nth-of-type(1) h2. Pick visible elements the narration references.
Beats 4–5: scroll_to deeper sections — feature lists, screenshots, pricing, social proof. Zoom section h2, [class*="feature"] img, [class*="testimonial"], [class*="pricing"], or any prominent semantic element on the page.
Beats 6–7: scroll_to CTA / signup / demo embed. Zoom [class*="cta"], button, a[class*="button"], or [id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.)
Beat 8: scroll_to footer / closing element, zoom footer h2, footer, or back to top with h1. Outro sentence.

Step 4 — TTS

Step 4.5 — Audio length verification

The TTS engine's actual output rate often diverges from the 2.5 wps estimate in Step 3. Verify before committing to the long-pole lipsync step.

If |audio_duration_seconds - beats[-1].t_end| > 10 seconds, abort and re-author the beat sheet with a tighter / looser word budget.
If audio_duration_seconds > 90, abort regardless — Kling avatar/image2video stalls reliably past 90s of audio under current load.

This gate exists to catch length drift before it costs 20 minutes of Kling timeout. Tarball had an equivalent verification table.

Step 5 — Preview gate (opt-in via `--preview`)

--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.

If --preview was supplied:

mcp__pika__generate_speech with text: "Hi, I'm your presenter. Let's explore this repo together." → preview_audio_url.
mcp__pika__generate_lipsync with provider: <resolved_lipsync_provider> (defaults to pika; honor --lipsync-provider kling if supplied), image: <avatar>, audio: preview_audio_url → preview_lipsync_url (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
Present to the user verbatim:

Preview ready: <preview_lipsync_url> This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply yes to proceed, or anything else to cancel.
Match ^(yes|go|proceed|confirm|y)$ (case-insensitive). Anything else → STOP, no further MCP calls.

Step 6 — Build `timed_actions` and record

For each beat in order, emit one entry:

navigate beats: {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absolute at_s + 0.6 s, scrolls bbox_selector into view, and measures the bbox — all without the caller scheduling a follow-up step.
scroll_to / hover beats: {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's own selector drives the page scroll; bbox_selector drives the bbox measurement (it can be the same selector or different — usually the same). (capture_website has no hover; scroll-into-view is the analog.)

Call mcp__pika__capture_website:

url: <beat 0's action.url>
timed_actions: <the N-element list built above> (one entry per beat)
duration_s: max(ceil(beats[-1].t_end), ceil(audio_duration_seconds)) — covers both the beat budget AND any TTS overrun. MiniMax-TTS commonly produces audio ~5-10% longer than the 2.5 wps estimate (Step 4.5 already gates drift > 10s); using the max ensures the screen recording covers the full lipsync, otherwise edit_pip's shortest=1 would clip the recording's tail and you'd lose the last few seconds of audio with no screen behind it.

Generic-URL mode additions (per Step 2.6 pre-flight):

extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B> — defensive: hides common consent platforms via display: none !important; so even if the optional click misses, the banner is invisible in the recording.
Prepend a wait action {type: "wait", at_s: 0.0, ms: 2500} for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
If cookie_banner_present from Step 2.6 §B, also prepend a click action {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>} and shift all beat t_start / t_end values by +1.5s to compensate. Beat 1 navigates / scrolls at t_start: 1.5 (or whatever offset accommodates the dismissal animation). The lipsync audio also needs to start with a 1.5s lead-in pause — easiest to just have beat 1's vo_text begin with a half-second pause-friendly opener like "Alright," or "So,", or pad the audio externally before lipsync.
No cookie banner shifts needed if cookie_banner_present == false; just the prepended wait action.

Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.

action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.

GitHub mode: with one timed_action per beat, idx maps 1:1 to beat index — Step 8 uses entry.idx directly as beat_idx.
Generic-URL mode: the prepended wait (and optional cookie-dismissal click) shift the array by 1 or 2. Compute beat_idx = entry.idx - prepend_count where prepend_count is 1 (wait only) or 2 (wait + click). Skip entries where beat_idx < 0 (those are the prepended setup actions, not beats).

The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.

Step 7 — Browser chrome

mcp__pika__edit_browser_frame:

video_url: <Step 6 video_url>
url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)
tab_title: <30-char title> — GitHub mode: (meta.description or repo_name or "")[:30]. Generic-URL mode: the page's <title> (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against None/empty.

Returns framed_url (1280×800 Sonoma + chrome).

Step 8 — Build `zoom_keyframes` and apply

Constants:

INTRO_BEATS = 2 — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
HOLD_GAP = 0.6 — seconds of 1.0× before each zoom-in and after each zoom-out.
MIN_BEAT_DUR = 1.5 — beats shorter than this are skipped (no room for a meaningful zoom).
SCALE = 1.35 (precise element-targeted zoom).
FALLBACK_SCALE = 1.25 (default-position fallback when no usable bbox).
FALLBACK_RAMP = 0.4.

edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).

Coord transform (recording px → framed px):

cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)

Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.

Constants for the default position:

DEFAULT_CX = 56 + 1168 // 2 (screen center of the framed canvas)
DEFAULT_CY = 108 + 637 // 3 (upper-third of the content area, where most GitHub UI prominence lives)

Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:

If t_end - t_start < MIN_BEAT_DUR (1.5s), skip — too short for a meaningful zoom.
Compute the keyframe's interior interval as [t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip.
Otherwise pre-fill that beat's slot in a per-beat map (call it zoom_keyframes_by_beat[beat_idx]) with {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)} plus the trimmed t_start/t_end.

Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.

For each entry in action_bboxes:

beat_idx = entry.idx (since Step 6 emits one timed_action per beat). If beat_idx < INTRO_BEATS, skip.
If entry.found is false, skip.
If the beat isn't already in zoom_keyframes_by_beat (was filtered out in Step 8a by MIN_BEAT_DUR/1.0s rules), skip.
Filter degenerate bboxes: skip if bbox.y > recording_viewport.h (offscreen capture — page didn't scroll the element into view in time) or bbox.h > recording_viewport.h * 1.5 (full-page <main> element — yields a meaningless zoom center).
Compute cx_framed/cy_framed from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.

Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.

This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.

Step 9 — Lipsync the full audio

mcp__pika__generate_lipsync:

provider: <resolved_lipsync_provider> — default: pika (parrot a2v). Honor --lipsync-provider kling if explicitly passed.
image: <avatar>
audio: <Step 4 audio_url>

Provider tradeoffs:

Provider	Wall-clock	Head motion	When to use
`pika` (default)	~2–5 min	Slightly more dramatic, naturalistic	Default for most runs — fast iteration, watchable output, ~10× faster than kling
`kling` (opt-in)	~5–30 min	Minimal, face-centered, presenter-style	High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole

Failover:

If pika fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with provider: "kling".
If kling stalls past the worker's 1200s ceiling (visible as repeated processing status with no completion) → fall back to provider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.

Why pika is the default:

Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.

For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.

Step 10 — PiP composite

mcp__pika__edit_pip:

main_video_url: <zoomed_url>
overlay_video_url: <lipsync_url>
shape: "circle"
size_px: 246 ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
stroke_width_px: 3
stroke_color: "white"
position_px: {x: 20, y: 476} ← 800 − 246 − 78 for dock clearance (matches tarball's H − CIRCLE_OUT − 78)

Do NOT pass size — size_px and size are mutually exclusive. Returns final_url.

Step 11 — Burn captions

Step 12 — Return

Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.

Known gaps (carried as follow-up server-side work)

Kling avatar mode:"pro" and prompt not exposed. Tarball calls Kling directly with {mode:"pro", prompt:"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"}. The Pika MCP generate_lipsync wrapper drops both for the kling provider (schema says prompt is "parrot only"; mode is hardcoded). Real quality lever for reducing dramatic head motion in the lipsync. Server PR follow-up: surface prompt and mode on generate_lipsync for kling.
No white-frame trim on the screen recording. Tarball's recorder.py:165-185 detects mean-brightness > 245 in the first 4s and trims with ffmpeg. capture_website has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms wait action at at_s: 0.0 mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
No networkidle wait on per-beat navigation. Tarball uses page.goto(url, wait_until="networkidle", timeout=20000) plus wait_for_timeout(600) after every navigate. capture_website settles to domcontentloaded plus the bbox-capture branch's 600 ms post-action settle (server-side, when bbox_selector is set), but SPA blob pages whose final render happens after domcontentloaded can still get bbox'd against unmounted code blocks. Worker enhancement: expose a wait_until knob on timed_actions[].navigate.
No per-step output-size verification gates. Tarball's verify() helper at github_explainer.py:35-39 checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra mcp__pika__analyze_media call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
text_content bbox capture not implemented. capture_website v1 returns action_bboxes only for steps with a CSS selector. text_content-only steps produce no entry. Prefer CSS selectors in zoom_target for guaranteed zoom coverage.
Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (<h1> + clear <section> + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names — tile-headline instead of hero-title defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's scrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with loading="lazy" may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass --focus "the X feature" to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + wait_until=networkidle + animation-trigger via IntersectionObserver polyfill).
Cookie-banner click is single-attempt. Step 2.6 §B emits one click against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the extra_css payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per click action so the worker tries each in order.

Auth

Examples

GitHub-mode (repo-aware: README scan + live-demo detection):

/pika:explainer https://github.com/leigest519/OpenGame
/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
/pika:explainer https://github.com/openai/whisper --preview (opt-in to the preview gate when testing a new avatar)

Generic-URL mode (any non-GitHub URL — drives through the page directly):

/pika:explainer https://pika.art
/pika:explainer https://linear.app --focus "the cycle planning view"
/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview

explainer

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

explainer

Tool Access

Preview

SKILL.md

/pika:explainer

Behavior

Defaults — fire fast, no mid-flow confirmation

Local avatar images on Claude Desktop

Step 0 — Resolve URL (empty-args menu)

Step 1 — Parse input + detect mode

Step 2 — Read source (no MCP call)

Step 2.5 — Verify live_url reachability (GitHub mode only, no MCP call)

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

Step 3 — Author beat sheet (main thread, no MCP call)

Step 4 — TTS

Step 4.5 — Audio length verification

Step 5 — Preview gate (opt-in via --preview)

Step 6 — Build timed_actions and record

Step 7 — Browser chrome

Step 8 — Build zoom_keyframes and apply

Step 9 — Lipsync the full audio

Step 10 — PiP composite

Step 11 — Burn captions

Step 12 — Return

Known gaps (carried as follow-up server-side work)

Auth

Examples

Similar Skills

Help us improve

/pika:explainer

Behavior

Defaults — fire fast, no mid-flow confirmation

Local avatar images on Claude Desktop

Step 0 — Resolve URL (empty-args menu)

Step 1 — Parse input + detect mode

Step 2 — Read source (no MCP call)

Step 2.5 — Verify live_url reachability (GitHub mode only, no MCP call)

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

Step 3 — Author beat sheet (main thread, no MCP call)

Step 4 — TTS

Step 4.5 — Audio length verification

Step 5 — Preview gate (opt-in via --preview)

Step 6 — Build timed_actions and record

Step 7 — Browser chrome

Step 8 — Build zoom_keyframes and apply

Step 9 — Lipsync the full audio

Step 10 — PiP composite

Step 11 — Burn captions

Step 12 — Return

Known gaps (carried as follow-up server-side work)

Auth

Examples

Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

Step 5 — Preview gate (opt-in via `--preview`)

Step 6 — Build `timed_actions` and record

Step 8 — Build `zoom_keyframes` and apply

Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

Step 5 — Preview gate (opt-in via `--preview`)

Step 6 — Build `timed_actions` and record

Step 8 — Build `zoom_keyframes` and apply