Help us improve
Share bugs, ideas, or general feedback.
From kling-ai-prompt-generator
Generates videos from text prompts or images, animates still images, and creates talking avatars from photos with audio using Kling AI models (VIDEO 3.0, Avatar 2.0, etc.). Handles multi-shot storyboards, character consistency, and prompt engineering.
npx claudepluginhub maciejdzierzek/kling-ai-prompt-generator --plugin kling-ai-prompt-generatorHow this skill is triggered — by the user, by Claude, or both
Slash command
/kling-ai-prompt-generator:kling-aiThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Source-of-truth for the facts in this skill: the official Kling release notes at [kling.ai/release-note](https://kling.ai/release-note). When numbers, durations, credit costs, or language lists matter for a deliverable, verify them there before quoting.
Generates videos from text prompts via fal.ai models like Kling 2.6 Pro, Sora 2, LTX-2 Pro, Runway Gen-3 Turbo, Luma Dream Machine; supplies endpoints, durations, aspect ratios, prompt structures, TypeScript/Python code.
Catalogs Kling AI models for video (T2V/I2V), image generation, lip sync, and effects with versions, speeds, qualities, resolutions, and costs for selection.
Generates AI videos from text descriptions or images using Google Veo 3.1 (default) or OpenAI Sora. Supports dialogue/audio, reference images, image-to-video animation, and interactive requirement gathering.
Share bugs, ideas, or general feedback.
Source-of-truth for the facts in this skill: the official Kling release notes at kling.ai/release-note. When numbers, durations, credit costs, or language lists matter for a deliverable, verify them there before quoting.
Primary interface: app.klingai.com/global (or kling.ai/app)
Alternative platforms with Kling integration: Higgsfield, Pollo.ai, Fal.ai, Media.io, Artlist, Vidful.ai, Scenario, BasedLabs, LetzAI, PiAPI, kie.ai
Subtle breeze moves through hair. Eyes blink naturally. Camera static.
That's it. For model selection, advanced prompting, avatars, and multi-shot workflows - read on.
| Model | Best For | Resolution | Audio | Max Duration | Released |
|---|---|---|---|---|---|
| VIDEO 3.0 | Cinematic storytelling, multi-shot, native audio, multilingual dialogue | up to 4K | Yes (5 langs) | 15s | 2026-01-31 |
| VIDEO 3.0 Omni | VIDEO 3.0 capabilities + video element reference + element voice control | up to 4K | Yes (5 langs) | 15s | 2026-01-31 |
| VIDEO 3.0 Motion Control | Motion transfer with high facial consistency, including occlusions and multi-angle | 1080p | Optional | 30s | 2026-03-04 |
| Avatar 2.0 | Talking avatars from 1 image + 1 audio file, up to 5 minutes | 1080p / 48fps | Lip-sync to provided audio | 5 min | 2025-12-04 |
| Kling 2.6 | Older Native Audio pipeline (EN+ZH), good fast/budget option | 1080p | EN + ZH | 10s | 2025-12-03 |
| Kling 2.5 Turbo | Fastest, simplest scenes, draft work | 1080p | No | 10s | earlier |
Important clarification on the 3.0 Series: "Kling 3.0" is a series name, not a single model. It contains VIDEO 3.0 (upgrade path from VIDEO 2.6) and VIDEO 3.0 Omni (upgrade path from the older VIDEO O1). Both share the new unified multimodal training framework; Omni adds video element reference and element voice control. Third-party reviews sometimes merge them into one - the official release notes do not.
Choose VIDEO 3.0 when:
Choose VIDEO 3.0 Omni when:
Choose VIDEO 3.0 Motion Control when:
Choose Avatar 2.0 when:
Choose Kling 2.6 when:
Choose 2.5 Turbo when:
The headline feature of Avatar 2.0: one image + one audio file → talking avatar with synchronized expressions, body language, and hand gestures. Up to 5 minutes of continuous output for any scenario (knowledge sharing, song performance, advertising, storytelling).
What 2.0 improved over 1.0:
Note on languages: Kling officially lists multilingual support for Avatar 2.0 as English, Japanese, Korean, Chinese. However, the model lip-syncs to whatever audio file you provide - it uses your audio as the reference, not just trained-language detection. Field-tested confirmation: Polish audio (e.g., ElevenLabs-generated) works and produces correct lip-sync. Other non-listed languages will likely work too. Practical guidance: don't tell the user "your language isn't supported" - if they have a good audio file, try it. Worst case the sync is slightly off and a paid generation is wasted; best case (most common) it works perfectly.
When NOT to use Avatar 2.0: When you need full scene control (complex cinematography, environment, camera moves) - use VIDEO 3.0 talking-head workflow instead. Avatar 2.0 is purpose-built for face/voice content with a relatively static framing.
Multi-shot was introduced in the 3.0 generation - VIDEO 2.6 did not support it. Instead of one continuous clip, direct a complete scene sequence in a single generation pass.
Two ways to use it:
Example custom multi-shot prompt structure:
Shot 1 (3s): Wide establishing shot of rain-slicked Tokyo street at night, neon reflections on pavement. Camera: static.
Shot 2 (4s): Medium shot - young woman in red coat emerges from subway exit, looks around. Camera: slow push in.
Shot 3 (3s): Close-up on her face, raindrops on cheek, determined expression. Camera: static.
Shot 4 (5s): She walks toward camera into the crowd. Camera: tracking shot from behind.
Launched 2026-01-29 - an AI orchestrator on top of the Kling generation models that automates the storyboard production pipeline:
When the user needs more than a single clip - a complete short film with consistent characters, or a multi-asset deliverable - Canvas Agent is the right starting point.
Define exactly where a video starts and ends visually.
Use this to create near-seamless loops by matching the end frame composition to the start.
Kling's official mechanism for keeping subjects, items, and scenes consistent across shots is the Element system (sometimes called Element Library or Element Reference). Third-party reviews call this "Character LoRA" - the user-facing feature in Kling is named Element.
How it works (VIDEO 3.0 / Omni):
@protagonist)@element_name syntaxFor commercial work where identity must stay locked across many shots, this is the recommended approach.
Launched in VIDEO 2.6 on 2025-12-16 as a separate feature from generic Native Audio. Voice Control extracts the timbre from your uploaded audio sample into a reusable Voice Embedding, then binds it to characters via the [Character] @VoiceName syntax.
Key facts (verify before quoting):
[Subject] @VoiceName in the prompt - e.g., [Livestream Host] @Sweet Female Voice: "This top is a trending must-have!"@VoiceNameThis is what you want when you need a consistent IP voice or brand persona across many videos. For one-off talking-head content with audio you already have, Avatar 2.0 is a different tool entirely.
Current Motion Control is the VIDEO 3.0 Motion Control model (launched 2026-03-04), an upgrade over VIDEO 2.6 Motion Control (Dec 2025).
What 3.0 Motion Control adds over 2.6: facial element binding for high facial consistency across angles, emotions, and even occlusions (hands, props in front of face). Element Binding requires the character orientation to match the video orientation.
Always write Kling prompts in English, regardless of the language the user is writing in. Kling was trained predominantly on English and Chinese; English prompts produce significantly better and more predictable results than prompts in less-represented languages.
The workflow is:
If the user writes their prompt in Polish or another language, translate it to English before presenting the final version they should paste into Kling. Explain this briefly if it's not obvious.
Exception: when using Native Audio features (VIDEO 3.0), dialogue text inside [Speaker: Name] "text" syntax should be in the target language. VIDEO 3.0 supports Chinese, English, Japanese, Korean, Spanish with dialects and accents. The surrounding prompt structure should still be in English.
For Avatar 2.0: language is determined by the audio file you upload, not by the prompt. The framing prompt itself stays English.
Kling 3.0 understands cinematic intent. The key shift: stop writing prompts like image captions, start directing like a DoP (Director of Photography). Think of each prompt as a mini-screenplay:
Scene setting - Camera direction - Subject action
[Scene/Environment] + [Characters/Subjects] + [Sequential Actions] + [Camera Movement] + [Audio & Style]
VIDEO 3.0 handles sequential actions reliably: "First she looks up, then turns toward the window, finally smiles." Use this for any action that has phases.
| Model | Prompt Length | Max Elements | Key Pattern |
|---|---|---|---|
| VIDEO 3.0 / 3.0 Omni | 100-200 words | 6-7 | Sequential actions, multi-shot, native audio |
| Avatar 2.0 | 20-50 words (framing only) | 1-2 | Tone/energy descriptor; audio does the work |
| 2.6 | 50-80 words | 5-7 | Include audio instructions (EN/ZH) |
| 2.5 Turbo | 40-60 words | 3-4 | Keep it simple |
All [element] must remain absolutely fixed and unchanged throughout.
[element] stays completely static. No movement on [element].
For critical logos or text: generate without text, add as a post-production overlay.
Subtle [motion type], gentle oscillation, returns to starting position.
Breathing effect, slow pulse, cyclical movement.
With 3.0: use first/last frame control with matching compositions. Post-process with 0.3-0.5s cross-dissolve.
See references/prompt-templates.md for ready-to-use templates organized by use case (image-to-video, multi-shot, text-to-video, Avatar 2.0, audio).
| Issue | Cause | Solution |
|---|---|---|
| Generation stuck at 99% | Open-ended motion | Add endpoint: "then stops", "returns to start" |
| Burning credits on bad prompts | Going to 4K too early | Always iterate at 1080p first; only render 4K once prompt is locked |
| Unwanted camera movement | No camera instruction | Add "static camera" or specific movement |
| Text/elements changing | AI interpretation | Repeat "fixed", "unchanged", "static" multiple times |
| Character morphing across shots | Identity drift | Use VIDEO 3.0 Omni with Element Reference (multi-image/video). Rewrite prompt from scratch rather than retrying with same settings. |
| Artifacts on hands/faces | Model limitation | Simplify scene, reduce duration, use 4K with 3.0 |
| Avatar 2.0 lip-sync looks off | Difficult audio (noise, multi-speaker, distorted) | Use a clean single-speaker audio source; check current docs for language support |
| Avatar 2.0 body too static / too active | No framing prompt | Add a one-line tone/energy descriptor: "calm presenter, minimal gestures" or "energetic performer, full body movement" |
| Voice Control sounds wrong | Bad reference audio | Use 5-30s clean clip, single speaker, neutral emotion, low background noise |
| Multi-shot feels disjointed | No shared style anchor | Define visual style once at start, use Element Reference for people/items |
| Audio desync | Missing speaker attribution | Use [Speaker: Name] "dialogue" format, or with Voice Control [Character] @VoiceName |
Duration:
Aspect Ratios: 16:9 (landscape), 9:16 (vertical/mobile), 1:1 (square)
Quality / Output:
See references/pricing.md for plan structure, per-feature inclusions, credit costs per model, and third-party API providers (fal.ai, PiAPI, kie.ai). Pricing changes frequently.
Key facts (verified against the official pricing page 2026-05-21):
Languages supported (per official 3.0 release note): Chinese, English, Japanese, Korean, Spanish, with dialects and accents. Multi-character coreference (3+ characters) and per-character voice precision.
Speaker attribution syntax (critical for accurate lip-sync in T2V/I2V workflows):
[Speaker: Character Name] "[dialogue]" in a [tone/emotion] [accent] voice.
Add [sound: footsteps / rain / door closing] when [action occurs].
Background ambient: [environment description].
Chinese and English only. Same speaker-attribution syntax.
A separate, more advanced feature. See the Voice Control section above. Currently Chinese-English only; +2 credits/sec.
Uses your provided audio directly - no voice cloning or attribution syntax needed. Lip-sync is automatic.
When facts in this skill matter for a deliverable (credit pricing, language lists, duration limits), verify against the current Kling state: