Help us improve
Share bugs, ideas, or general feedback.
How this skill is triggered — by the user, by Claude, or both
Slash command
/comfy:wan-t2v-videoThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
WAN 2.2 T2V generates videos from text prompts using a 14B parameter MoE (Mixture of Experts) architecture split across two specialized models:
Guides ComfyUI workflows for WAN 2.2 I2V first-last-frame video generation using required dual hi-lo KSamplerAdvanced passes or WanVideoWrapper VACE.
Generates videos from text prompts via fal.ai models like Kling 2.6 Pro, Sora 2, LTX-2 Pro, Runway Gen-3 Turbo, Luma Dream Machine; supplies endpoints, durations, aspect ratios, prompt structures, TypeScript/Python code.
Generates ComfyUI workflow JSON files from natural language descriptions for txt2img, img2img, txt2vid, img2vid, upscale, inpaint, audio, and 3D tasks. Outputs valid, importable JSON with model download links and custom node requirements.
Share bugs, ideas, or general feedback.
WAN 2.2 T2V generates videos from text prompts using a 14B parameter MoE (Mixture of Experts) architecture split across two specialized models:
This dual-model technique is the same as FLF/I2V (see wan-flf-video skill) but without image conditioning nodes.
Key difference from I2V/FLF: T2V does NOT use CLIPVisionEncode, WanFirstLastFrameToVideo, or any image input. It uses EmptyHunyuanLatentVideo for latent initialization and text-only conditioning.
| Model | Loader | Notes |
|---|---|---|
Wan2_2-T2V-A14B_HIGH_fp8_e4m3fn_scaled_KJ.safetensors | UNETLoader | HighNoise expert, 14.3GB FP8 |
Wan2_2-T2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors | UNETLoader | LowNoise expert, 14.3GB FP8 |
| Component | Node | Model | Notes |
|---|---|---|---|
| CLIP (T5) | CLIPLoader (type=wan) | umt5_xxl_fp8_e4m3fn_scaled.safetensors | UMT5-XXL fp8, in clip/ |
| Component | Node | Model |
|---|---|---|
| VAE | VAELoader | wan_2.1_vae.safetensors |
| Model | Size | Notes |
|---|---|---|
Wan2_2_Fun_VACE_module_A14B_HIGH_bf16.safetensors | 5.8GB | HighNoise VACE module |
Wan2_2_Fun_VACE_module_A14B_LOW_bf16.safetensors | 5.8GB | LowNoise VACE module |
VACE modules add reference image / pose / depth conditioning to T2V. See WanVideoWrapper section below.
| LoRA | Applies To | Path |
|---|---|---|
wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise | HighNoise UNET | Unknown/no tags/ |
wan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise | LowNoise UNET | Unknown/no tags/ |
| LoRA | Path |
|---|---|
Wan2.2_HN_T2V_Lightning_4steps-lora-rank64-Seko_V2.0_HIGH | Root loras/ |
Wan2.2_HN_T2V_Lightning_4steps-lora-rank64-Seko_V2.0_LOW | Root loras/ |
| LoRA | Path | Notes |
|---|---|---|
lightx2v_T2V_14B_cfg_step_distill_v2_lora_rank128_bf16 | Root loras/ | CFG+step distilled, use with more steps |
| Parameter | Pass 1 (Hi) | Pass 2 (Lo) |
|---|---|---|
| model | Hi + Hi Lightning LoRA | Lo + Lo Lightning LoRA |
| add_noise | enable | disable |
| steps | 4 | 4 |
| cfg | 1.0 | 1.0 |
| sampler_name | euler | euler |
| scheduler | simple | simple |
| start_at_step | 0 | 2 |
| end_at_step | 2 | 4 |
| return_with_leftover_noise | enable | disable |
| Parameter | Pass 1 (Hi) | Pass 2 (Lo) |
|---|---|---|
| model | Hi + ModelSamplingSD3 (shift=8) | Lo + ModelSamplingSD3 (shift=8) |
| add_noise | enable | disable |
| steps | 20 | 20 |
| cfg | 3.5 | 3.5 |
| sampler_name | euler | euler |
| scheduler | simple | simple |
| start_at_step | 0 | 10 |
| end_at_step | 10 | 20 |
| return_with_leftover_noise | enable | disable |
Required for WAN 2.2 flow matching. Apply to BOTH models:
{
"class_type": "ModelSamplingSD3",
"inputs": { "model": ["<unet>", 0], "shift": 8 }
}
T2V shift values:
Creates the initial video latent for T2V (no image input):
{
"class_type": "EmptyHunyuanLatentVideo",
"inputs": {
"width": 832,
"height": 480,
"length": 81,
"batch_size": 1
}
}
This replaces WanFirstLastFrameToVideo (which is for FLF/I2V only). The latent goes directly to KSamplerAdvanced Pass 1.
The tones are vibrant, overexposed, static, details are unclear, subtitles, style, work, painting, image, still, overall grayish, worst quality, low quality, JPEG compression artifacts, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, distorted limbs, merged fingers, motionless image, cluttered background, three legs, many people in the background, walking backwards
UNETLoader (HIGH T2V) → ModelSamplingSD3 (shift) → LoraLoaderModelOnly (Hi Lightning) → MODEL_HI
UNETLoader (LOW T2V) → ModelSamplingSD3 (shift) → LoraLoaderModelOnly (Lo Lightning) → MODEL_LO
CLIPLoader (wan) → CLIP
├─ CLIPTextEncode (positive) → CONDITIONING
└─ CLIPTextEncode (negative) → CONDITIONING
VAELoader → VAE
EmptyHunyuanLatentVideo (832x480, 81 frames) → LATENT
KSamplerAdvanced (Hi: MODEL_HI, steps 0-2, add_noise=enable, return_leftover=enable)
→ noisy LATENT
KSamplerAdvanced (Lo: MODEL_LO, steps 2-4, add_noise=disable, return_leftover=disable)
→ final LATENT
VAEDecode → IMAGE → VHS_VideoCombine → MP4
{
"1": { "class_type": "UNETLoader", "inputs": { "unet_name": "Wan2_2-T2V-A14B_HIGH_fp8_e4m3fn_scaled_KJ.safetensors", "weight_dtype": "default" }, "_meta": { "title": "UNET HighNoise T2V" }},
"2": { "class_type": "UNETLoader", "inputs": { "unet_name": "Wan2_2-T2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors", "weight_dtype": "default" }, "_meta": { "title": "UNET LowNoise T2V" }},
"3": { "class_type": "ModelSamplingSD3", "inputs": { "model": ["1", 0], "shift": 5 }, "_meta": { "title": "Hi Shift" }},
"4": { "class_type": "ModelSamplingSD3", "inputs": { "model": ["2", 0], "shift": 5 }, "_meta": { "title": "Lo Shift" }},
"5": { "class_type": "LoraLoaderModelOnly", "inputs": {
"model": ["3", 0],
"lora_name": "Unknown\\no tags\\wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors",
"strength_model": 1.0
}, "_meta": { "title": "Hi Lightning" }},
"6": { "class_type": "LoraLoaderModelOnly", "inputs": {
"model": ["4", 0],
"lora_name": "Unknown\\no tags\\wan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise.safetensors",
"strength_model": 1.0
}, "_meta": { "title": "Lo Lightning" }},
"7": { "class_type": "CLIPLoader", "inputs": { "clip_name": "umt5_xxl_fp8_e4m3fn_scaled.safetensors", "type": "wan" }},
"8": { "class_type": "VAELoader", "inputs": { "vae_name": "wan_2.1_vae.safetensors" }},
"9": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["7", 0], "text": "<positive prompt describing the video scene and motion>" }, "_meta": { "title": "Positive" }},
"10": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["7", 0], "text": "The tones are vibrant, overexposed, static, details are unclear, subtitles, worst quality, low quality, motionless image" }, "_meta": { "title": "Negative" }},
"11": { "class_type": "EmptyHunyuanLatentVideo", "inputs": {
"width": 832, "height": 480, "length": 81, "batch_size": 1
}},
"12": { "class_type": "KSamplerAdvanced", "inputs": {
"model": ["5", 0],
"positive": ["9", 0],
"negative": ["10", 0],
"latent_image": ["11", 0],
"add_noise": "enable", "noise_seed": 0, "steps": 4, "cfg": 1,
"sampler_name": "euler", "scheduler": "simple",
"start_at_step": 0, "end_at_step": 2, "return_with_leftover_noise": "enable"
}, "_meta": { "title": "Hi Pass" }},
"13": { "class_type": "KSamplerAdvanced", "inputs": {
"model": ["6", 0],
"positive": ["9", 0],
"negative": ["10", 0],
"latent_image": ["12", 0],
"add_noise": "disable", "noise_seed": 0, "steps": 4, "cfg": 1,
"sampler_name": "euler", "scheduler": "simple",
"start_at_step": 2, "end_at_step": 4, "return_with_leftover_noise": "disable"
}, "_meta": { "title": "Lo Pass" }},
"14": { "class_type": "VAEDecode", "inputs": { "samples": ["13", 0], "vae": ["8", 0] }},
"15": { "class_type": "VHS_VideoCombine", "inputs": {
"images": ["14", 0], "frame_rate": 16, "loop_count": 0,
"filename_prefix": "wan_t2v", "format": "video/h264-mp4",
"pingpong": false, "save_output": true,
"pix_fmt": "yuv420p", "crf": 19, "save_metadata": true, "trim_to_audio": false
}}
}
Same structure as above but replace the LoRA and sampler settings:
shift to 8 in ModelSamplingSD3 nodessteps: 20, cfg: 3.5start_at_step: 0, end_at_step: 10start_at_step: 10, end_at_step: 20For more control, use the WanVideoWrapper custom node pack. Key differences from native:
WanVideoModelLoader → WANVIDEOMODEL typeWanVideoSampler with built-in shift parameterWanVideoModelLoader (T2V model) → WANVIDEOMODEL
WanVideoVAELoader → WANVAE
WanVideoTextEncode (positive + negative prompts) → WANVIDEOTEXTEMBEDS
WanVideoImageToVideoEncode (no images — creates empty embeds for T2V)
→ WANVIDIMAGE_EMBEDS
WanVideoSampler (model, image_embeds, text_embeds, steps, cfg, shift, scheduler)
→ LATENT
WanVideoDecode → IMAGE → VHS_VideoCombine → MP4
| Parameter | Standard | Lightning | Notes |
|---|---|---|---|
| steps | 30 | 4 | |
| cfg | 6.0 | 1.0 | |
| shift | 5.0 | 5.0 | Flow matching shift |
| scheduler | unipc | euler | |
| force_offload | true | true |
Located in loras/Wan Video 2.2 T2V-A14B/:
concept/PussyLoRA_HighNoise_Wan2.2_HearmemanAI.safetensors + LowNoise pairApply concept LoRAs the same way as lightning LoRAs — match hi/lo to the correct model pass. Use LoraLoaderModelOnly with strength 0.5–1.0.
| Aspect | Resolution | Notes |
|---|---|---|
| Landscape 16:9 | 832x480 | Default, recommended |
| Portrait 9:16 | 480x832 | |
| 720p landscape | 1280x720 | Higher quality, more VRAM |
| 720p portrait | 720x1280 |
Width and height must be divisible by 16.
4n + 1)Standard: 16 fps for WAN 2.2 output.
| Config | VRAM | Notes |
|---|---|---|
| Dual FP8 models + UMT5 fp8 | ~22-24GB | Tight on RTX 4090 |
| Single FP8 model (no dual) | ~14-16GB | Lower quality but safer |
| With VACE modules | +5.8GB per module | Very tight, may need block swap |
clear_vram before switching to WAN T2V from another model familyDescribe motion and temporal progression, not just a scene:
Good: "A beautiful young woman slowly walks through a blooming cherry blossom garden, petals drifting in the breeze, soft sunlight filtering through branches, cinematic slow motion, 4K quality"
Bad: "woman in garden"
Include motion cues: "slowly walks", "camera pans", "wind blowing", "gradually reveals"
| Feature | T2V | I2V/FLF |
|---|---|---|
| Input | Text only | Text + start/end images |
| Latent init | EmptyHunyuanLatentVideo | WanFirstLastFrameToVideo |
| CLIPVision | Not used | Required |
| Models | T2V-specific (HIGH/LOW) | I2V-specific (HIGH/LOW) |
| Lightning LoRAs | T2V-specific | I2V-specific |
| Creativity | Full creative freedom | Constrained by input frames |
| Use case | Original content | Transitions, animations |