From replicate
Prompting techniques for AI video generation models on Replicate. Use when writing prompts for video models or building video generation features.
npx claudepluginhub replicate/skills --plugin prompt-imagesThis skill uses the workspace's default tool permissions.
Distilled from Replicate's blog posts on prompting video models (2025-2026). Techniques are model-agnostic and focus on transferable principles. For model selection, pricing, and feature comparison, see the [compare-models](../compare-models/SKILL.md) skill.
Generates AI videos from text descriptions or images using Google Veo 3.1 (default) or OpenAI Sora. Supports dialogue/audio, reference images, image-to-video animation, and interactive requirement gathering.
Generates videos from text prompts via fal.ai models like Kling 2.6 Pro, Sora 2, LTX-2 Pro, Runway Gen-3 Turbo, Luma Dream Machine; supplies endpoints, durations, aspect ratios, prompt structures, TypeScript/Python code.
Generates AI videos using Google Veo models via nano-banana CLI for text-to-video, image animation, lip-sync dialogue, and scene extensions.
Share bugs, ideas, or general feedback.
Distilled from Replicate's blog posts on prompting video models (2025-2026). Techniques are model-agnostic and focus on transferable principles. For model selection, pricing, and feature comparison, see the compare-models skill.
A good video prompt is a scene description, not a caption. Write what happens, where, and how it looks.
Vague: "A car chase"
Specific: "A high-speed car chase on a rain-drenched highway at night. Two muscle cars weave through heavy traffic at 140mph, headlights slicing through the downpour. One car clips a semi-truck sending sparks showering across six lanes. Tires hydroplane on standing water. Neon highway signs blur overhead."
Modern video models handle long, dense prompts well. Don't write "a man on the phone." Write "a desperate man in a weathered green trench coat picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign." Every concrete detail you add gives the model less room to improvise poorly.
Use descriptive phrases like "the woman in the red jacket" or "the bearded man in flannel." Avoid pronouns, which are ambiguous to video models just as they are to image models.
Video models understand filmmaking language. Use it to direct the shot rather than hoping for good framing.
Use standard shot terminology to control framing:
Describe how the camera moves:
Specify the camera's height and angle:
A natural progression for short clips is wide > medium > close-up > extreme close-up. This maps well onto 8-15 second clips and gives the model clear structure. For example:
Many video models generate audio natively alongside the visuals. If you don't prompt for the audio you want, the model will guess, and it often guesses wrong.
If you skip ambient audio, models may hallucinate inappropriate sounds. A common failure mode is adding a "live studio audience" laughing in the background. Prevent this by describing the soundscape explicitly: "sounds of distant bands, noisy crowd, ambient background of a busy festival field."
There are two approaches:
Explicit dialogue should be short enough to fit the clip duration. Packing too much dialogue into an 8-second clip produces unnaturally fast speech. Too little dialogue can produce awkward silence or AI gibberish.
Many video models were trained on videos with baked-in subtitles and will add them to outputs. To prevent this:
If a model mispronounces a name or word, spell it phonetically in the prompt. For example, write "foh-fur" instead of "fofr" or "Shreedar" instead of "Shridhar."
In multi-character scenes, the model can mix up who says what. Tie dialogue to distinctive visual descriptions: "The woman wearing pink says: ..." and "The man with glasses replies: ..."
Some models support generating multiple shots within a single clip (up to ~15 seconds). You can direct each shot individually using time codes.
Write timestamps directly into the prompt:
[0-4s]: Wide establishing shot, static camera, misty bamboo forest at dawn
[4-9s]: Medium shot, slow push-in, the fighter steps forward
[9-15s]: Close-up, orbit shot, the fighter strikes, slow motion
Each shot should specify:
Use explicit transition instructions between shots:
Without explicit transitions, the model improvises, which may or may not match your intent.
(0-3s) Macro shot of a luxury perfume bottle among scattered pink peonies,
shallow depth of field, petals floating in warm afternoon light,
soft ambient music.
(3-7s) Camera glides closer, a feminine hand enters frame from the right,
fingers gently touch the glass bottle, the sound of silk rustling.
(7-12s) Hard cut to slow-motion spray, golden mist diffuses through the air,
particles catching rim light against a dark background,
the hiss of the atomizer.
(12-15s) Seamless pull-out to hero frame, product centered, volumetric
lighting, minimal cream background, elegant silence.
Many video models accept images, video clips, or audio files as reference inputs alongside a text prompt. This shifts the workflow from "prompting" to something closer to "directing."
Feed a starting image and describe the motion. The model animates from that frame.
Some models accept both a starting and ending image. The model generates the transition between them. This is useful for:
Some models accept reference images of characters, products, or objects and maintain their appearance in the generated video. This is useful for:
When referencing input assets, many models use a bracket syntax like [Image1] or [Audio1] in the prompt to specify which reference maps to which role: "[Image2] is in the interior of [Image1]."
Some models accept audio files and sync the generated video to the audio. The model can match:
When using audio references, it helps to also transcribe the audio content in the text prompt itself, and match the video duration to the audio length.
The most powerful results come from combining multiple reference types:
Video models understand style labels. Include them directly in your prompt:
Style labels affect not just the visual look but also how characters move and interact. A claymation style produces jerky, stop-motion movement. An anime style produces fluid, exaggerated motion.
Phrases like "hyper-realistic, 8k" or "cinematic" push models toward their highest fidelity output. Use them when you want photorealistic results.
Reference specific genres or filmmaking styles for mood and tone:
Rather than describing a style verbally, generate an image with the exact aesthetic you want using an image model, then pass it to the video model. This gives you pixel-level control over the look. The video model preserves the style, color grading, and composition while adding motion.
Adding "slightly grainy, film-like" or "VHS aesthetic" pushes output away from the too-clean AI look and makes videos feel more organic.
When generating multiple clips with the same character, use identical character descriptions across prompts. Create a "character sheet" with exact wording:
"John, a man in his 40s with short brown hair, wearing a blue jacket and glasses, looking thoughtful"
Paste this description into every prompt where John appears. The more specific and unique the description, the more consistent the results.
When placing a consistent character in different scenarios, change only the action, location, and camera work. Keep the character description word-for-word identical.
If the model supports subject reference images, use a clear photo of the character as input. This is more reliable than text descriptions alone, especially for maintaining facial features across clips.
Not describing audio: If you skip audio prompting, models hallucinate ambient sounds. A common failure is adding inappropriate laughter or a "live studio audience." Always describe the soundscape.
Too much dialogue for the clip length: An 8-second clip can hold roughly 2-3 short sentences. Packing in a paragraph produces unnaturally fast speech or truncated output.
Too little dialogue for the clip length: If you only provide a few words for a long clip, the model fills silence with gibberish or awkward pauses. Match dialogue length to clip duration.
Not specifying what to keep unchanged: When using reference images or editing, always state what should stay the same. Without explicit instructions, models may change anything.
Expecting variation from identical prompts: Unlike image models, some video models produce very similar outputs for the same prompt (even with different seeds). If you want variety, change the prompt, don't just rerun it.
Not prompting camera motion: Without camera direction, you get either static shots or unpredictable movement. Describe the camera explicitly.
Subtitle contamination: Many models were trained on videos with baked-in subtitles. Use colons for dialogue (not quotes), add "(no subtitles)", and repeat if necessary.
Vague prompts for complex scenes: Modern video models handle long, detailed prompts. A prompt with 12+ specific requirements (camera moves, lighting, sound design, subject actions, environmental details) can work if each requirement is stated clearly. Don't undersell what you want.
Ignoring aspect ratio and resolution: Most video models have specific resolutions they support (480p, 720p, 1080p). Check what the model supports and choose the right resolution for your use case. If you need vertical video and the model only outputs landscape, you may need to reframe with a separate tool.
Forgetting that video models don't have internet access: No video model has live information. They work from training data. Don't expect them to know about current events or real-time information.
All techniques in this skill are sourced from Replicate's blog: