Help us improve
Share bugs, ideas, or general feedback.
From external-gitcode-ascend-skills
Guides adaptation of Wan-series video diffusion models from NVIDIA CUDA to Huawei Ascend NPU, covering device layer, operator replacement, distributed parallelism, attention optimization, VAE parallelization, and quantization.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:wan-ascend-adaptationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Provide a systematic, step-by-step guide for adapting Wan-series (and similar DiT-based) video generation models from NVIDIA CUDA/GPU to Huawei Ascend NPU. The skill encodes 9 major adaptation domains covering every layer of the inference stack, from device initialization to distributed parallelism.
references/01-device-layer.mdreferences/02-operator-replacement.mdreferences/03-precision-strategy.mdreferences/04-attention-mechanism.mdreferences/05-distributed-parallelism.mdreferences/06-vae-patch-parallel.mdreferences/07-model-quantization.mdreferences/08-sparse-attention.mdreferences/09-pipeline-integration.mdTrains multimodal generative models (text-to-video/image) on Huawei Ascend NPU. Covers Megatron, FSDP2, DeepSpeed backends for models like Wan, HunyuanVideo, FLUX, SD3, etc.
Creates p5.js generative art with seeded randomness, noise fields, and interactive parameter exploration. Use for algorithmic art, flow fields, or particle systems.
Share bugs, ideas, or general feedback.
Provide a systematic, step-by-step guide for adapting Wan-series (and similar DiT-based) video generation models from NVIDIA CUDA/GPU to Huawei Ascend NPU. The skill encodes 9 major adaptation domains covering every layer of the inference stack, from device initialization to distributed parallelism.
The adaptation work is organized into 9 domains. Each domain has a dedicated reference file under references/ with detailed instructions, code patterns, and pitfalls.
| # | Domain | Reference File | Priority |
|---|---|---|---|
| 1 | Device Layer Adaptation | references/01-device-layer.md | P0 — Must |
| 2 | Operator Replacement | references/02-operator-replacement.md | P0 — Must |
| 3 | Precision Strategy | references/03-precision-strategy.md | P0 — Must |
| 4 | Attention Mechanism | references/04-attention-mechanism.md | P1 — Critical |
| 5 | Distributed Parallelism | references/05-distributed-parallelism.md | P1 — Critical |
| 6 | VAE Patch Parallel | references/06-vae-patch-parallel.md | P2 — Important |
| 7 | Model Quantization | references/07-model-quantization.md | P2 — Important |
| 8 | Sparse Attention (RainFusion) | references/08-sparse-attention.md | P2 — Important |
| 9 | Inference Pipeline Integration | references/09-pipeline-integration.md | P1 — Critical |
To adapt a Wan-series model to Ascend, follow these steps in order:
Read references/01-device-layer.md for complete guidance.
Key actions:
torch_npu and transfer_to_npu at the entry pointdist.init_process_group(backend="nccl") with backend="hccl"torch.amp.autocast('cuda', ...) with autocast('npu', ...)'cuda' to 'npu'Read references/02-operator-replacement.md for complete guidance.
Key actions:
torch_npu.npu_rms_norm().float() type castingmindiesd.rotary_position_embedding() fused operatormindiesd.fast_layernorm via FAST_LAYERNORM env varmindiesd.attention_forward() multi-backend dispatchRead references/03-precision-strategy.md for complete guidance.
Key actions:
.float() type conversions in normalization layersPRECISION env var to control random number device for cross-platform reproducibilityRead references/04-attention-mechanism.md for complete guidance.
Key actions:
ALGO env var (0/1/3)xFuserLongContextAttention combining Ulysses + Ring Attentionmindiesd.CacheAgentUSE_SUB_HEAD env varRead references/05-distributed-parallelism.md for complete guidance.
Key actions:
ParallelConfig with 4D parallelism: TP × SP × CFGRankGenerator for orthogonal process group assignmentGroupCoordinator with dual-channel communication (HCCL + Gloo)TensorParallelApplicator for automatic model shardingRead references/06-vae-patch-parallel.md for complete guidance.
Key actions:
F.conv3d, F.conv2d, F.interpolate, F.pad for boundary exchangeRead references/07-model-quantization.md for complete guidance.
Key actions:
msmodelslim for W8A8 dynamic quantizationmindiesd.quantize() for runtime quantization loadingpatch_cast_buffers_for_float8()Read references/08-sparse-attention.md for complete guidance.
Key actions:
Read references/09-pipeline-integration.md for complete guidance.
Key actions:
T5_LOAD_CPU for flexible T5 loading strategyfreqs_list) lifecycle managementrank < 8)stream.synchronize()| Variable | Default | Description |
|---|---|---|
ALGO | 0 | Attention algorithm: 0=fused_attn_score, 1=ascend_laser_attention, 3=npu_fused_infer |
FAST_LAYERNORM | 0 | Enable mindiesd fast LayerNorm |
USE_SUB_HEAD | 0 | Sub-head group size for attention splitting |
T5_LOAD_CPU | 0 | Load T5 model on CPU to save NPU memory |
PRECISION | 0 | Generate random numbers on CPU for cross-platform reproducibility |
OVERLAP | 0 | Enable FA-AllToAll communication overlap |
PYTORCH_NPU_ALLOC_CONF | - | NPU memory allocation strategy |
TASK_QUEUE_ENABLE | - | NPU task queue optimization |
CPU_AFFINITY_CONF | - | CPU affinity configuration |
| Library | Purpose |
|---|---|
torch_npu | PyTorch Ascend NPU backend |
mindiesd | MindIE Stable Diffusion acceleration (FA, RoPE, LayerNorm, quantize) |
msmodelslim | Huawei model compression toolkit (W8A8 quantization) |
yunchang | Sequence parallel framework (Ulysses + Ring Attention) |
torch_atb | Ascend Transformer Boost operators |
atb_ops | ATB fused matmul-allreduce operators |