From pika
~60-80s explainer video for any URL — GitHub repo, product page, docs site, blog post, or launch. CANONICAL workflow for URL walkthroughs. Use when the user asks to "explain this URL / repo / website / product", "make a walkthrough video for [url]", "demo this site", "Loom-style explainer of [url]", "explainer for github.com/...", or "explain this product link". Drives a real browser through the URL, generates an avatar lipsync, and composites in a 1280×800 macOS Sonoma frame with a 246-pixel bottom-left avatar circle. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); other URLs use a generic page-walkthrough flow.
npx claudepluginhub pika-labs/pika-plugins --plugin pikaThis skill uses the workspace's default tool permissions.
Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs ...
Generates MP4 walkthrough videos from app screenshots or live sites using Remotion. Adds smooth transitions, zoom effects, text overlays, progress bars, optional voiceover narration for demos, showcases, docs.
Creates product documentation: React help pages, AI-generated screenshots, Remotion videos with TTS narration and music, GitHub README enhancements. Use for docs, help pages, product tours, screenshots, user guides.
Generates high-energy sizzle reel MP4 videos from brand assets and key messages via HyperFrames using GSAP animations, headless Chromium rendering, and FFmpeg encoding. For hype videos, event promos, or investor pitches.
Share bugs, ideas, or general feedback.
Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.
Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]
--avatar, --voice) when supplied; otherwise resolve via identity_avatar_url / identity_voice_id and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).--preview (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.--focus either. Make a confident first attempt from page structure; users re-run with --focus "X" if the angle missed.These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.
Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as --avatar, pause Step 1 and kindly send them this — something like:
Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:
- Paste a URL if it's already hosted (Imgur, S3, your site) — fastest
- Zip it and attach the
.zip— right-click → Compress (macOS) / Send to → Compressed folder (Windows) /zip pic.zip pic.png(Linux). I'll take it from there.
When a .zip arrives, unzip it via Bash, call upload_asset for a presigned PUT URL, push the bytes with curl -X PUT, then use the returned public_url as --avatar <url> — all before Step 1. Already-hosted https://... URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.
Strip flags (--focus, --avatar, --voice, --live-url, --lipsync-provider, --no-captions, --preview, --skip-preview, --yes) and key=value parameters from $ARGUMENTS. If what remains contains no https://... URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message — do NOT call any tool, do NOT proceed to Step 1, do NOT invent a URL. If $ARGUMENTS already carries a URL, skip this step silently and proceed to Step 1.
Which URL would you like me to walk through? Works on any of:
- A GitHub repo — e.g.
https://github.com/anthropics/claude-code(activates repo-aware mode: README scan + live-demo detection)- A product page / launch page — e.g.
https://pika.art- A docs site — e.g.
https://docs.anthropic.com- A blog post / article URL
Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass
--previewif you want a 3-second lipsync sanity check first.Reply with the URL and I'll start.
Tip: you don't need to type
/pika:explainer— just say things like "walk me through ", "make a demo video of ", or "explain this repo: " and I'll fire this skill automatically.
When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.
Required: url (must be https://...).
Optional: --avatar <url> (overrides identity-store default), --voice <minimax-voice-id>, --focus "..." (editorial guidance woven into vo_text), --live-url <url> (force-supply live demo URL — GitHub mode only), --lipsync-provider <pika|kling> (defaults to pika — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass kling for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), --no-captions (skip the Step 11 caption burn — default is captions on), --preview (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). --skip-preview and --yes are accepted as no-ops for backward compatibility.
Mode detection:
github.com AND path matches /{owner}/{repo} (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors./blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.Avatar resolution (silent — never ask the user):
--avatar <url> was passed, use it.mcp__pika__identity_avatar_url. If non-null, use it.mcp__pika__generate_image once with prompt "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting" and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.Voice resolution (silent — never ask the user):
If --voice <id> was passed, use it.
Else call mcp__pika__identity_voice_id. If non-null, use it.
Else pick a casual MiniMax speech-2.8-hd preset matching the resolved avatar's apparent gender:
English_PlayfulGirl (warm, casual, clearly female-voiced — verified)English_Jovialman (warm, casual male)English_Jovialman (default)Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.
Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.
The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.
Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).
GitHub mode additions: also fetch top-level file tree, (best-effort) package.json / pyproject.toml, and GitHub API repo metadata via gh api repos/{owner}/{repo} for homepage, description, language, topics. Detect a candidate live_url in this priority:
--live-url.meta.homepage field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball repo_analyzer.py:66-77).package.json "homepage" field.https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*.<project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback.https://{owner}.github.io/{repo} — but only if the deep tree contains a frontend signal (one of index.html, App.tsx, App.jsx, App.vue, app.py, main.py).If no candidate resolves, the beat sheet skips beats 6–7.
Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.
live_url reachability (GitHub mode only, no MCP call)If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:
live_url to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content."There isn't a GitHub Pages site here"), drop live_url and skip beats 6–7.live_url for beats 6–7.This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.
Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.
A. Bot-block / captcha detection — abort if matched:
If the response body contains any of:
"Verify you are human" / "verify you are not a robot""captcha" / "CAPTCHA" / "reCAPTCHA""403 Forbidden" / "Access Denied""Just a moment" + cf-chl-bypass (Cloudflare challenge)"We're sorry, something went wrong" (Amazon-style bot block)<title> or h1 of just "Robot Check" / "Are you a robot?"→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."
B. Cookie / consent-banner detection — defuse with extra_css + optional click:
Scan the response for these patterns (case-insensitive):
onetrust-, truste-, cookie-banner, cookie-consent, gdpr-, consent-, cmp-(?i)accept (all )?cookies / (?i)agree.{0,10}cookies / (?i)i (accept|agree)ac-gdpr-banner or class as-globalfooter-curtain[role="dialog"] with text "Before you continue"If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:
extra_css) in the capture_website call to hide common banners universally — even if the click below misses, the banner is visually gone.click timed_action at at_s: 0.0 against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. #onetrust-accept-btn-handler, [aria-label*="Accept all" i], button[id*="accept"]).The extra_css payload (use this verbatim — covers ~80% of consent platforms):
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; } /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
C. Real-DOM element identification — emit concrete selectors:
Generic CSS selectors (h1, [class*="hero"], section h2) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses tile-headline / as-headline-section-title, not hero-*). For each beat, prefer the actual DOM elements observed in the WebFetch:
<h1> text and class.zoom_target.selector using the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind _1a2b3c, CSS modules module__hero___xYz).D. SPA / lazy-render detection — bump initial wait:
If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-domcontentloaded. Emit a longer initial wait action ({type: "wait", at_s: 0.0, ms: 2500}) before any beat fires, instead of the default 600ms settle.
E. --focus is honored when supplied (do not solicit):
Without --focus, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with --focus "the X feature" if the first pass misses the angle they wanted. With --focus supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.
Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:
| README contains... | Required beat |
|---|---|
how it works | scroll_to that heading; zoom article h2:has(#user-content-how-it-works) |
audio layer / audio timeline | scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading |
claude code / mcp integration | scroll_to that section; zoom article pre or .highlight (terminal screenshot / code block) |
architecture / system design | scroll_to that section; zoom article h2:has(#user-content-architecture) |
features (when prominent at top) | scroll_to that heading; zoom article h2:has(#user-content-features) |
getting started / quick start / installation | scroll_to that heading; zoom article h2:has(#user-content-installation) (or the matching slug) — falls back to article pre if you want the install code block instead |
usage / examples | scroll_to that heading; zoom article h2:has(#user-content-usage) (or the matching slug) — or the first code block under it |
GitHub heading slug rule: lowercase, spaces → dashes, strip non-[a-z0-9-] characters. So "How it works" → #user-content-how-it-works, "Quick Start" → #user-content-quick-start. GitHub injects the <a id="user-content-{slug}"> anchor inside each rendered <hN>, so hN:has(#user-content-{slug}) reliably grabs the heading element across any GitHub README.
Selector contract: bbox_selector MUST be vanilla CSS that resolves via document.querySelector (capture_website runs the post-action smooth-scroll JS via page.evaluate, which uses the browser's native selector engine). Do NOT use Playwright extensions like :has-text("..."), text=..., or :visible — those resolve in Playwright's page.query_selector (so the bbox capture finds the element) but silently fail in the smooth-scroll's document.querySelector (so the page never scrolls to the target, and bbox.y ends up at document-Y instead of top - 60 px, which trips Step 8b's bbox.y > recording_viewport.h degenerate filter and falls back to default-position zoom). CSS Level 4 :has(...) IS vanilla and supported in modern Chromium.
These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with SPECIAL rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.
Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:
{
"t_start": 0.0,
"t_end": 7.5,
"action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
"zoom_target": { "selector": "...", "description": "..." },
"vo_text": "exact words to speak — 1 to 2 conversational sentences"
}
Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):
t_start, t_end, action (with type and url), zoom_target (with selector), vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball's github_explainer.py:183-190 validation pass.)t_start of beat 0 = 0.0; t_end[i] == t_start[i+1] (continuity).len(vo_text.split()) / 2.5 ≈ t_end - t_start per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the vo_text until it fits.t_end of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)zoom_target.selector MUST be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors: h1.f1, #readme, article h2, .blob-code-inner, .highlight, .octicon-star, nav. Generic-URL mode prefers robust generic selectors: h1, [role="main"], main, header, nav, .hero, .feature, section h2, [class*="cta"], [class*="hero"], button, a[href]. Selectors must resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.vo_text is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.action.url is a valid https://... URL when action.type == "navigate"; required.Self-check before Step 4: verify total_words is in [165, 200] AND total_seconds (= beats[-1].t_end) is in [65, 80]. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)
Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):
navigate repo root, zoom h1.f1 (repo title), hook sentence.navigate to specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom .blob-code-inner or .highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about.scroll_to README sections, zoom article h2 or #readme. If Step 3.0 surfaced required sections, replace these slots with the required ones.live_url survived Step 2.5): navigate to live_url, zoom nav / h1 / .hero / main / button / .feature..octicon-star, outro.Structural skeleton — Generic-URL mode:
navigate to the input URL, zoom h1 or [class*="hero"] h1 (the page's primary headline), hook sentence.scroll_to the page's hero / value-prop / first feature section. Zoom .hero, [class*="hero"], [class*="feature"], or section:nth-of-type(1) h2. Pick visible elements the narration references.scroll_to deeper sections — feature lists, screenshots, pricing, social proof. Zoom section h2, [class*="feature"] img, [class*="testimonial"], [class*="pricing"], or any prominent semantic element on the page.scroll_to CTA / signup / demo embed. Zoom [class*="cta"], button, a[class*="button"], or [id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.)scroll_to footer / closing element, zoom footer h2, footer, or back to top with h1. Outro sentence.If --focus is supplied, weave its angles into vo_text without mutating the structural skeleton. Prefer CSS selectors over text_content in zoom_target.selector — bbox capture is selector-only (see Known gaps).
Call mcp__pika__generate_speech with provider: "minimax-tts", text: <full vo_text join>, optional voice_id. Capture result.audio_url (the dispatcher returns audio under audio_url, not url) and result.duration_seconds. Voice defaults to identity-store injection in plugin mode.
The TTS engine's actual output rate often diverges from the 2.5 wps estimate in Step 3. Verify before committing to the long-pole lipsync step.
|audio_duration_seconds - beats[-1].t_end| > 10 seconds, abort and re-author the beat sheet with a tighter / looser word budget.audio_duration_seconds > 90, abort regardless — Kling avatar/image2video stalls reliably past 90s of audio under current load.This gate exists to catch length drift before it costs 20 minutes of Kling timeout. Tarball had an equivalent verification table.
--preview)Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed --preview — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.
--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.
If --preview was supplied:
mcp__pika__generate_speech with text: "Hi, I'm your presenter. Let's explore this repo together." → preview_audio_url.
mcp__pika__generate_lipsync with provider: <resolved_lipsync_provider> (defaults to pika; honor --lipsync-provider kling if supplied), image: <avatar>, audio: preview_audio_url → preview_lipsync_url (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
Present to the user verbatim:
Preview ready:
<preview_lipsync_url>This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Replyyesto proceed, or anything else to cancel.
Match ^(yes|go|proceed|confirm|y)$ (case-insensitive). Anything else → STOP, no further MCP calls.
timed_actions and recordTranslate the beat sheet into capture_website timed_actions. One timed_action per beat — set bbox_selector to the beat's zoom_target.selector and capture_website captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-top - 60 px → 1300 ms post-anim → measure, all server-side).
For each beat in order, emit one entry:
navigate beats: {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absolute at_s + 0.6 s, scrolls bbox_selector into view, and measures the bbox — all without the caller scheduling a follow-up step.scroll_to / hover beats: {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's own selector drives the page scroll; bbox_selector drives the bbox measurement (it can be the same selector or different — usually the same). (capture_website has no hover; scroll-into-view is the analog.)Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from t=0 of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at t=0 with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.
Call mcp__pika__capture_website:
url: <beat 0's action.url>timed_actions: <the N-element list built above> (one entry per beat)duration_s: max(ceil(beats[-1].t_end), ceil(audio_duration_seconds)) — covers both the beat budget AND any TTS overrun. MiniMax-TTS commonly produces audio ~5-10% longer than the 2.5 wps estimate (Step 4.5 already gates drift > 10s); using the max ensures the screen recording covers the full lipsync, otherwise edit_pip's shortest=1 would clip the recording's tail and you'd lose the last few seconds of audio with no screen behind it.Generic-URL mode additions (per Step 2.6 pre-flight):
extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B> — defensive: hides common consent platforms via display: none !important; so even if the optional click misses, the banner is invisible in the recording.wait action {type: "wait", at_s: 0.0, ms: 2500} for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.cookie_banner_present from Step 2.6 §B, also prepend a click action {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>} and shift all beat t_start / t_end values by +1.5s to compensate. Beat 1 navigates / scrolls at t_start: 1.5 (or whatever offset accommodates the dismissal animation). The lipsync audio also needs to start with a 1.5s lead-in pause — easiest to just have beat 1's vo_text begin with a half-second pause-friendly opener like "Alright," or "So,", or pad the audio externally before lipsync.cookie_banner_present == false; just the prepended wait action.Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.
action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.
idx maps 1:1 to beat index — Step 8 uses entry.idx directly as beat_idx.wait (and optional cookie-dismissal click) shift the array by 1 or 2. Compute beat_idx = entry.idx - prepend_count where prepend_count is 1 (wait only) or 2 (wait + click). Skip entries where beat_idx < 0 (those are the prepended setup actions, not beats).The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.
mcp__pika__edit_browser_frame:
video_url: <Step 6 video_url>url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)tab_title: <30-char title> — GitHub mode: (meta.description or repo_name or "")[:30]. Generic-URL mode: the page's <title> (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against None/empty.Returns framed_url (1280×800 Sonoma + chrome).
zoom_keyframes and applyConstants:
INTRO_BEATS = 2 — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).HOLD_GAP = 0.6 — seconds of 1.0× before each zoom-in and after each zoom-out.MIN_BEAT_DUR = 1.5 — beats shorter than this are skipped (no room for a meaningful zoom).SCALE = 1.35 (precise element-targeted zoom).FALLBACK_SCALE = 1.25 (default-position fallback when no usable bbox).FALLBACK_RAMP = 0.4.edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).
Coord transform (recording px → framed px):
cx_framed = 56 + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637 / recording_viewport.h)
Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:
Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.
Constants for the default position:
DEFAULT_CX = 56 + 1168 // 2 (screen center of the framed canvas)DEFAULT_CY = 108 + 637 // 3 (upper-third of the content area, where most GitHub UI prominence lives)Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:
t_end - t_start < MIN_BEAT_DUR (1.5s), skip — too short for a meaningful zoom.[t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip.zoom_keyframes_by_beat[beat_idx]) with {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)} plus the trimmed t_start/t_end.Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.
For each entry in action_bboxes:
beat_idx = entry.idx (since Step 6 emits one timed_action per beat). If beat_idx < INTRO_BEATS, skip.entry.found is false, skip.zoom_keyframes_by_beat (was filtered out in Step 8a by MIN_BEAT_DUR/1.0s rules), skip.bbox.y > recording_viewport.h (offscreen capture — page didn't scroll the element into view in time) or bbox.h > recording_viewport.h * 1.5 (full-page <main> element — yields a meaningless zoom center).cx_framed/cy_framed from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.
This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.
If len(zoom_keyframes) > 0, call mcp__pika__edit_animate_zoom with video_url: framed_url, zoom_keyframes. Returns zoomed_url. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use framed_url as zoomed_url.
mcp__pika__generate_lipsync:
provider: <resolved_lipsync_provider> — default: pika (parrot a2v). Honor --lipsync-provider kling if explicitly passed.image: <avatar>audio: <Step 4 audio_url>Provider tradeoffs:
| Provider | Wall-clock | Head motion | When to use |
|---|---|---|---|
pika (default) | ~2–5 min | Slightly more dramatic, naturalistic | Default for most runs — fast iteration, watchable output, ~10× faster than kling |
kling (opt-in) | ~5–30 min | Minimal, face-centered, presenter-style | High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole |
Server-side-await covers the call inline; if the response shape is {task_id, status: "queued"}, poll mcp__pika__task_status in a tight loop (no sleep) until the status reaches a terminal state (done, failed, or cancelled). On done, capture lipsync_url. On failed / cancelled, fall back to the other provider (kling ↔ pika) per the failover note below.
Failover:
pika fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with provider: "kling".kling stalls past the worker's 1200s ceiling (visible as repeated processing status with no completion) → fall back to provider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.Why pika is the default:
For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.
mcp__pika__edit_pip:
main_video_url: <zoomed_url>overlay_video_url: <lipsync_url>shape: "circle"size_px: 246 ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2stroke_width_px: 3stroke_color: "white"position_px: {x: 20, y: 476} ← 800 − 246 − 78 for dock clearance (matches tarball's H − CIRCLE_OUT − 78)Do NOT pass size — size_px and size are mutually exclusive. Returns final_url.
Master-duration / audio-source contract (matching tarball github_explainer.py:418-419, 531-533, 578-582): edit_pip uses shortest=1 semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's duration_s = max(ceil(beats[-1].t_end), ceil(audio_duration_seconds)) ensures the screen recording is ≥ the lipsync, so the composite duration is set by the lipsync. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone audio_url is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.
Call mcp__pika__add_captions(video_url=<final_url>, style="classic"). classic renders a bottom subtitle bar — the right register for an explainer video (use tiktok / hormozi / karaoke only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as captioned_url.
Skip this step only if the user passed --no-captions (parsed in Step 1) — the default is captions on. (Note: /pika:podcast does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)
Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.
mode:"pro" and prompt not exposed. Tarball calls Kling directly with {mode:"pro", prompt:"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"}. The Pika MCP generate_lipsync wrapper drops both for the kling provider (schema says prompt is "parrot only"; mode is hardcoded). Real quality lever for reducing dramatic head motion in the lipsync. Server PR follow-up: surface prompt and mode on generate_lipsync for kling.recorder.py:165-185 detects mean-brightness > 245 in the first 4s and trims with ffmpeg. capture_website has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms wait action at at_s: 0.0 mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.networkidle wait on per-beat navigation. Tarball uses page.goto(url, wait_until="networkidle", timeout=20000) plus wait_for_timeout(600) after every navigate. capture_website settles to domcontentloaded plus the bbox-capture branch's 600 ms post-action settle (server-side, when bbox_selector is set), but SPA blob pages whose final render happens after domcontentloaded can still get bbox'd against unmounted code blocks. Worker enhancement: expose a wait_until knob on timed_actions[].navigate.verify() helper at github_explainer.py:35-39 checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra mcp__pika__analyze_media call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.text_content bbox capture not implemented. capture_website v1 returns action_bboxes only for steps with a CSS selector. text_content-only steps produce no entry. Prefer CSS selectors in zoom_target for guaranteed zoom coverage.<h1> + clear <section> + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names — tile-headline instead of hero-title defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's scrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with loading="lazy" may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass --focus "the X feature" to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + wait_until=networkidle + animation-trigger via IntersectionObserver polyfill).click against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the extra_css payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per click action so the worker tries each in order.If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for @pika.art Google login). For non-interactive environments, set MCP_AUTH_TOKEN.
GitHub-mode (repo-aware: README scan + live-demo detection):
/pika:explainer https://github.com/leigest519/OpenGame/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"/pika:explainer https://github.com/openai/whisper --preview (opt-in to the preview gate when testing a new avatar)Generic-URL mode (any non-GitHub URL — drives through the page directly):
/pika:explainer https://pika.art/pika:explainer https://linear.app --focus "the cycle planning view"/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview