From pika
Multi-cut jump-cut UGC product ad — HOOK + 3 JUMP CUTs + OUTRO, 15s, 9:16 vertical (3:4 optional, seedance only), POV first-person talking-head selfie, every beat has spoken dialogue with native lip-sync, 5-act narrative arc (set → name → reveal → twist → punchline). Six category essences (HAUL / APP / FOOD / BEAUTY / FITNESS / TECH) auto-picked from the input URL. Different from `/pika:short-ads`: that one is a polished single-clip brand commercial with logo reveal + BGM; this one is creator-style raw UGC talking-head with multi-beat conversational dialogue. Use when the user asks to "make a UGC ad", "jump-cut product ad", "POV product reveal", "creator-style ad", "haul-style ad", "unboxing ad", "TikTok-style product video", "talking-head ad about [URL]".
npx claudepluginhub pika-labs/pika-plugins --plugin pikaThis skill uses the workspace's default tool permissions.
| Param | Default | Notes |
Generates AI UGC videos and SaaS reviews from terminal using agent-media CLI with actors, lip-sync, scripts, and broll screenshots.
Orchestrates AI video production workflow: gathers specs interactively, generates scripts/storyboards, Gemini TTS voiceovers, Lyria music, Veo 3.1 clips or image animations, assembles with FFmpeg.
Acts as AI creative director for video production including product ads, short films, montages, TikTok e-commerce. Analyzes inputs, writes English prompts, generates assets, submits tasks.
Share bugs, ideas, or general feedback.
| Param | Default | Notes |
|---|---|---|
url | required | product URL — drives category detection and beat substitution |
avatar_url | built-in fallback | persona portrait URL; fed as @Image1 reference. When omitted, the skill uses a pre-generated Pixar-style female creator portrait |
provider | seedance | seedance: strong at UGC selfie / talking-head POV with native lip-sync, multi-segment in single prompt, supports 3:4. kling: explicit shots[], 9:16/16:9 only |
aspect_ratio | 9:16 | 3:4 is seedance-only (kling rejects 3:4) |
category | auto | HAUL / APP / FOOD / BEAUTY / FITNESS / TECH; auto-picked from URL |
captions | true | TikTok-style word-chunked captions burned on top of the final video |
Typical end-to-end run: 6–12 minutes. Breakdown:
generate_reference_video): ~3–5 min for seedance, ~5–7 min for klingadd_captions call, ~30s–5 min (transcribe + burn in one shot)If the run exceeds 15 min without progress, something is wrong — check task_status returned errors.
WebFetch the URL: pull product_name, value prop, brand color, product form, packaging, hero copy, target user, category, and the primary language of the page. Use category= if passed; else trust the WebFetch signal; fall back to HAUL for physical, APP for digital.
avatar_url was passed → use it as-is.https://cdn.pika.art/v2/files/agent/17d62bf9-0edb-49e4-9ba9-2c5419fa518f/seedream-1777624057811.jpeg
Pre-generated 3D animated Pixar-style portrait of a young female creator — pre-cartoonized so seedance moderation accepts it directly, neutral enough to fit any category. Note in the final summary that the fallback was used so the caller knows to supply their own portrait for persona consistency next time.Call capture_website with mode: "screenshot". Use mobile=true for handheld-product categories (APP / FITNESS / BEAUTY) so the captured page renders as a portrait phone screen; mobile=false for desktop-context categories (HAUL / TECH / FOOD).
If the call fails (timeout, browser pool down), retry once. If still failing, proceed without the screenshot — the skill is degraded but functional. The close-up beat then describes the page from prose only and Beat 2's reference_images is just [avatar_url].
Capture URL → screenshot_url (or null).
The full prompt is a single multi-beat string passed to one generate_reference_video call. Structural prose (not markdown bullets). Every beat has a Says: "..." line for lip-sync. Pacing target ~5.5–6 words per second across the whole 15-second ad (≈85–90 words total). @Image1 is the avatar, @Image2 is the screenshot when available.
Write all Says: "..." lines in the language detected from step 1's WebFetch. Both seedance and kling lip-sync handle multilingual; if the product page is Chinese / Japanese / Spanish / etc., the dialogue should be in that language. Hook archetypes from step 5 are language-agnostic — adapt the rhetorical move to the language's natural register.
HOOK (0–3 sec) <visual setting + creator framing + face/body cue>. Says to camera, fast and energetic: "<hook line>". <style anchor — POV handheld, authentic, raw>.
JUMP CUT 1 (3–6 sec) <wide POV — creator's body language, product partially in frame edge>. <face cue>, says fast: "<setup line>".
JUMP CUT 2 (6–9 sec) <next visual beat — could be the screen close-up showing @Image2 OR another reaction beat, depending on which beat the dialogue arc puts the reveal>. Says (or voice continues over the shot if it's a screen close-up), fast and confident: "<reveal line>".
JUMP CUT 3 (9–12 sec) <next visual beat — same logic; one of the JUMP CUTs is the screen close-up, the others are wide-POV reaction shots>. Says, fast: "<insight twist line>".
OUTRO (12–15 sec) <selfie POV, mid-chest framing, same setting>. Says to camera, fast: "<punchline line>".
avatar is image 1, asset is image 2
Screen-close-up beat — exactly one across the ad, position is dialogue-driven:
@Image2 exactly as-is and includes ONE finger-point gesture (a single finger entering from the frame edge, pointing at the hero text or product — no tap, no swipe, no scroll, no hover-on-CTA). The point gesture is the only screen interaction in the entire ad.Trust @Image2 — when the product page is shown, reference the image; do NOT describe its UI in prose. Describing UI triggers the model to invent extra panels / dropdowns / sidebars / animations. Reference the image; trust it.
Each essence is the brief you read before composing the 5 beats. Pick one from category in step 1 and write the actual Says: "..." lines tailored to the real product.
@Image2 is a product photo (or brand-site mobile view); the single finger-point lands on a hardware detail (chain, clasp, embossed logo).mobile=true in step 3).@Image2 is typically a real photo of the device's screen at its key UI moment (first measurement, paired status, hero feature open); if the device has no screen, a clean hero photo of it mid-use.Right before the generate call, fetch the user's voice sample URL: call identity_voice_sample_url. This returns a short-lived download URL (mp3/wav) backing the user's registered voice, OR null if no voice is on file.
voice_sample_url and pass it on the next call's reference_audio array. Both seedance and kling accept reference_audio (seedance up to 3, ≤15s combined; kling up to 8). The model uses the sample to clone the speaker's timbre for the lip-sync.Always get this URL fresh right before step 7 — do NOT cache or reuse a stale URL across runs.
Always attempt the call first with the avatar resolved in step 2 (caller-supplied or built-in fallback) exactly as-is. The skill does not pre-process or pre-judge it. Only when seedance rejects the call do we restyle.
7a. First attempt — avatar as-is
Call generate_reference_video:
provider: seedance (default) or kling if user passed provider=klingaspect_ratio: 9:16 (default); 3:4 allowed only on seedanceresolution: 720p (seedance only)duration: 15reference_images: [avatar_url, screenshot_url] (drop screenshot_url if step 3 failed)reference_audio: [voice_sample_url] (omit the param entirely if step 6 returned null)prompt: the multi-beat string from step 4sound: true (default — ambient + lip-sync produced by the model)For provider=kling: convert the multi-beat prose into shots: [{prompt, duration}, ...] (5 shots × 3s = 15s sum), plus a top-level prompt summarizing the ad. References use <<<image_1>>> / <<<image_2>>> instead of @Image1 / @Image2.
If the call returns { task_id, status: "queued" }, poll task_status(task_id) in a tight loop (no Bash, no sleep) until done. Capture result.url → video_url and proceed to step 8.
7b. On rejection — auto-cartoonize the avatar
If 7a returns 422 content_policy_violation on image_urls / reference_images (seedance + fal-queue moderation flags portraits that read as too photorealistic — even some Pixar-style 3D avatars get flagged), restyle the avatar in-place:
Call generate_image:
provider: "seedream" (native Pixar/3D-animated look)reference_image: <avatar_url>aspect_ratio: same as the ad's aspect ratioresolution: "1K"prompt: "Stylized 3D game character render — Unreal Engine 5 / Overwatch / Valorant / Apex Legends visual style. Anatomically grounded facial proportions with subtle stylization: slightly larger expressive eyes, defined sculpted cheekbone planes, smooth skin shader (smoother than photoreal, no micropore detail), idealized but believable features. PBR materials with subtle subsurface scattering, strand-based hair simulation, crisp cloth shader. Cinematic three-point studio lighting with strong rim light. Clearly a stylized AAA-game-character render — NOT photorealistic person, NOT Pixar plastic-toy cartoon, NOT exaggerated big-head proportions. Same person, same glasses, same outfit, same accessories. Centered medium portrait, neutral indoor background."Capture returned URL → avatar_url_cartoon.
7c. Retry seedance with the cartoonized avatar
Re-run the exact same generate_reference_video call from 7a, swapping the avatar reference: reference_images: [avatar_url_cartoon, screenshot_url] (or [avatar_url_cartoon] if step 3 failed). All other params unchanged. Capture result.url → video_url.
7d. Final fallback — still rejected
If 7c also returns content_policy_violation, stop. Tell the user: the avatar reads as too realistic for seedance moderation even after auto-restyling; ask them to either supply a more stylized portrait themselves or rerun with provider=kling (kling has a separate moderation pipeline that accepts realistic avatars).
Skip if captions=false. Use one add_captions call instead of chaining edit_text_overlay per chunk — much faster (≤5 min single call vs 5–8 min sequential), and the styles position captions correctly out of the box.
Call add_captions:
video_url: video_url from step 7style: "tiktok" (default — word-by-word purple highlight, Bebas Neue, all caps, rendered at the bottom of the frame; classic TikTok-creator look that keeps the face and screen clear). Alternatives: "hormozi" (lower-middle yellow highlight, more aggressive — overlays part of the phone-in-hand close-up beat), "classic" (plain bottom subtitle bar, safest), "karaoke" (progressive color fill, also bottom).font_size: 60 — overrides the per-style default; tuned for 9:16 readability without dominating the frame.language: pass the BCP-47 code for the page language detected in step 1 ("en", "zh", "ja", "es", etc.) — skips auto-detect and avoids misrouting CJK to a Latin-only font path.Capture the returned URL → final_url.
Return final_url on one line, plus a one-line summary: which category ran, whether the avatar was caller-supplied / built-in fallback / cartoonize-recovered, whether the screenshot was used or fell back to prose, whether the user's voice sample was used or default, the provider chosen, the language detected for dialogue, and whether captions were burned on.
/pika:ugc-ads https://pika.me avatar_url=https://cdn/face.png → APP_REVEAL, 9:16, seedance, real screenshot, captions on/pika:ugc-ads https://maisonbrune.com avatar_url=https://cdn/face.png aspect_ratio=3:4 → HAUL_UNBOX, 3:4, seedance/pika:ugc-ads https://pika.me avatar_url=https://cdn/face.png provider=kling captions=false → APP_REVEAL, 9:16, kling shots[], no captions/pika:ugc-ads https://pika.me → no avatar_url → uses the built-in fallback Pixar-style female creator portrait, runs end-to-end