Orchestrates story-to-video pipeline: breaks text into scenes, generates consistent Z-Image hero/refs + Qwen Edit frames, WAN FLF clips, ffmpeg concatenation.
From comfynpx claudepluginhub artokun/comfyui-mcp --plugin comfyThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
The Director skill orchestrates a complete short film production from a text story. It breaks the story into scenes, generates start/end frames for each, creates video clips from frame pairs, and concatenates everything into a final video.
Pipeline: Story Planning → Z-Image Hero + Character Refs → Qwen Edit Chain (all frames) → WAN 2.2 FLF Video Clips → ffmpeg Concatenation
Key architectural decisions:
clear_vram between every model family switchIndependent Z-Image generations per scene produce different-looking characters. This was the #1 problem discovered during testing. The solution:
Phase 1: Story Planning → Break story into scenes (Claude reasoning, no ComfyUI)
Phase 2: Hero + Refs → Z-Image: 1 hero frame + character ref portraits + background ref
Phase 3: Hero Review → Visual verify hero and refs, user approves
Phase 4: Edit Chain → Qwen Edit: chain ALL scene frames from hero (with char refs in slots 2-3)
Phase 5: Frame Review → Visual verify all frames, approve/reject/retry
Phase 6: Video Clips → WAN 2.2 FLF dual Hi-Lo (one clip per scene)
Phase 7: Video Review → Preview each clip
Phase 8: Final Assembly → ffmpeg concat all clips into one MP4
Saved at ~/code/comfyui-mcp/workflows/director_state_{project_id}.json. Updated after every edit or phase completion.
{
"project_id": "story_20260216_143022",
"created": "2026-02-16T14:30:22Z",
"story": "Original user story text",
"current_phase": 4,
"orientation": "portrait",
"hero_frame": { "file": "director_hero_00001_.png", "seed": 428571, "approved": true },
"character_refs": {
"man": "director_ref_man.png",
"cat": "director_ref_cat.png",
"woman": "director_ref_woman.png",
"background": "director_ref_bedroom.png"
},
"scenes": [
{
"id": 1,
"description": "Brief scene description",
"edit_prompt_start": "Qwen Edit instruction to create start frame from source",
"edit_prompt_end": "Qwen Edit instruction to create end frame from source",
"edit_source_start": "hero",
"edit_source_end": "hero",
"video_prompt": "WAN motion description",
"start_frame": { "file": "director_s1_start_00001_.png", "seed": 12345, "approved": true },
"end_frame": { "file": "director_hero_00001_.png", "seed": null, "approved": true },
"video_clip": { "file": "director_s1_00001.mp4", "seed": 11111, "approved": false },
"status": "video_pending"
}
],
"final_video": null,
"settings": {
"start_frame_resolution": [832, 1472],
"video_resolution": [480, 720],
"video_frames": 81,
"video_fps": 16
}
}
| Phase | Model Family | Key Models | VRAM |
|---|---|---|---|
| 2: Hero + Refs | Z-Image | redcraftRedzimageUpdatedJAN30_redzibDX1.safetensors | ~17GB |
| 4: Edit Chain | Qwen Edit | qwen_image_edit_2511_bf16.safetensors + Lightning LoRA | ~17-18GB |
| 6: Video Clips | WAN 2.2 I2V | Remix NSFW Hi+Lo (built-in lightning) | ~22-24GB |
CRITICAL: clear_vram between every model family switch.
Break the story into 2-6 scenes. For each scene, identify:
Identify a hero frame — the single most representative scene image that establishes the main character and setting. This hero will anchor all other frames via Qwen Edit.
Also identify which character reference images are needed (portraits of each character, key props, background).
The end frame of Scene N must be the EXACT same image file as the start frame of Scene N+1. Do NOT create separate Qwen-edited start frames for subsequent scenes — this causes visible jumps at scene boundaries when the videos are concatenated.
The frame chain for video generation:
Scene 1: S1_start (unique) → hero (end)
Scene 2: hero (= S1 end) → S2_end
Scene 3: S2_end (= S2 end) → S3_end
Scene 4: S3_end (= S3 end) → S4_end
Scene 5: S4_end (= S4 end) → S5_end
Only Scene 1 needs a unique start frame. All other scenes inherit their start from the previous scene's end.
The edit chain produces only end frames (plus Scene 1's unique start frame). Map which end frame derives from which source:
Example chain:
Hero (man+cat on bed)
├─ S1 Start: edit hero → remove cat, man alone
├─ S2 End: edit hero → replace cat with woman
│ └─ S3 End: edit S2End → both sit up, man startled
│ └─ S4 End: edit S3End → sitting close, warm smiles
│ └─ S5 End: edit S4End → warm embrace
Generate with Z-Image RedCraft DX1 (10 steps, CFG 1, euler/simple):
Add to negative prompts for character refs to exclude wrong subjects (e.g., "woman, female" when generating man portrait).
{
"1": { "class_type": "CheckpointLoaderSimple", "inputs": { "ckpt_name": "redcraftRedzimageUpdatedJAN30_redzibDX1.safetensors" }},
"2": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["1", 1], "text": "<hero_prompt>" }, "_meta": { "title": "Positive" }},
"3": { "class_type": "CLIPTextEncode", "inputs": { "clip": ["1", 1], "text": "3D, ai generated, semi realistic, illustrated, drawing, comic, digital painting, 3D model, blender, video game screenshot, render, smooth textures, CGI, text, writing, subtitle, watermark, logo, blurry, low quality, jpeg artifacts, grainy" }, "_meta": { "title": "Negative" }},
"4": { "class_type": "EmptyLatentImage", "inputs": { "width": 832, "height": 1472, "batch_size": 1 }},
"5": { "class_type": "KSampler", "inputs": {
"model": ["1", 0], "positive": ["2", 0], "negative": ["3", 0], "latent_image": ["4", 0],
"seed": 42, "steps": 10, "cfg": 1, "sampler_name": "euler", "scheduler": "simple", "denoise": 1
}},
"6": { "class_type": "VAEDecode", "inputs": { "samples": ["5", 0], "vae": ["1", 2] }},
"7": { "class_type": "SaveImage", "inputs": { "images": ["6", 0], "filename_prefix": "director_hero" }}
}
Queue hero + all refs while Z-Image checkpoint is loaded (same checkpoint, different prompts).
Show hero frame and all character refs. User approves or requests regeneration with new seed.
{
"1": { "class_type": "UNETLoader", "inputs": { "unet_name": "qwen_image_edit_2511_bf16.safetensors", "weight_dtype": "default" }},
"2": { "class_type": "LoraLoaderModelOnly", "inputs": { "model": ["1", 0], "lora_name": "Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors", "strength_model": 1 }},
"3": { "class_type": "CLIPLoader", "inputs": { "clip_name": "qwen_2.5_vl_7b_fp8_scaled.safetensors", "type": "qwen_image" }},
"4": { "class_type": "VAELoader", "inputs": { "vae_name": "qwen_image_vae.safetensors" }},
"5": { "class_type": "LoadImage", "inputs": { "image": "<source_scene.png>" }, "_meta": { "title": "Source Scene" }},
"5b": { "class_type": "LoadImage", "inputs": { "image": "<character_ref.png>" }, "_meta": { "title": "Character Ref" }},
"5c": { "class_type": "LoadImage", "inputs": { "image": "<background_ref.png>" }, "_meta": { "title": "Background Ref" }},
"6": { "class_type": "TextEncodeQwenImageEditPlusAdvance_lrzjason", "inputs": {
"clip": ["3", 0], "prompt": "<edit_prompt>", "vae": ["4", 0],
"vl_resize_image1": ["5", 0],
"vl_resize_image2": ["5b", 0],
"vl_resize_image3": ["5c", 0],
"target_size": 1024, "target_vl_size": 384,
"upscale_method": "lanczos", "crop_method": "pad"
}},
"7": { "class_type": "ConditioningZeroOut", "inputs": { "conditioning": ["6", 0] }},
"8": { "class_type": "KSampler", "inputs": {
"model": ["2", 0], "positive": ["6", 0], "negative": ["7", 0], "latent_image": ["6", 1],
"seed": 42, "steps": 4, "cfg": 1, "sampler_name": "euler", "scheduler": "simple", "denoise": 1
}},
"9": { "class_type": "VAEDecode", "inputs": { "samples": ["8", 0], "vae": ["4", 0] }},
"10": { "class_type": "SaveImage", "inputs": { "images": ["9", 0], "filename_prefix": "director_s1_start" }}
}
Key: slots 5b and 5c — feed character reference and background reference into vl_resize_image2 and vl_resize_image3. This helps the vision encoder maintain character appearance across edits.
Edits are sequential — each depends on the previous output:
upload_image the outputIndependent edits (both from hero) can run in parallel.
For each frame, show via Read for visual inspection. User approves or provides feedback. Re-run individual edits without redoing the whole chain.
(Same as wan-flf-video skill — Remix NSFW Hi+Lo, 4-stack LoRA, ImageResizeKJv2, dual KSamplerAdvanced)
Key settings:
For transformation scenes (e.g., cat→woman), add morph LoRA to Hi/Lo Common stacks:
wan2.2_i2v_magical_morph_highnoise.safetensors → Hi Common slot 1 (strength 1.0)wan2.2_i2v_magical_morph_lownoise.safetensors → Lo Common slot 1 (strength 1.0)Use 1.0 strength — tested without sparkle issues. Lower values (0.7-0.85) produce weaker morph effects that may look like a dissolve rather than a true morph.
Swap per scene: start/end image filenames, positive prompt text, noise_seed, filename_prefix.
All 5 clips can be queued at once — they run sequentially in ComfyUI, sharing loaded models.
Report each clip's filename. User previews externally.
cd "<ComfyUI_output_dir>"
printf "file 'director_s1_00001.mp4'\nfile 'director_s2_00001.mp4'\n..." > concat_list.txt
ffmpeg -f concat -safe 0 -i concat_list.txt -c copy director_final_{project_id}.mp4
All clips share resolution/codec/framerate — copy-concat works without re-encoding.
After context compaction:
director_session_notes.md if it existscurrent_phase and per-scene statusclear_vram before loading the model family for the current phase| Phase | Per Scene | 5 Scenes |
|---|---|---|
| Hero + Refs (Z-Image) | ~10s each | ~50s (one-time) |
| Edit Chain (Qwen 4-step) | ~35s each | ~280s (8 edits) |
| Video Clip (WAN FLF 81 frames) | ~140s | ~700s |
| VRAM swaps (3x clear_vram) | ~30s each | ~90s |
| Total generation | ~19 min |
Use distinctive visual elements that transfer between characters/forms to create narrative connections: