From mindspeed-mm-skills
Guides end-to-end VLM training on Huawei Ascend NPU using MindSpeed-MM: Megatron/FSDP2/custom trainers, weight conversion, MLLM JSON datasets, fine-tuning, inference, evaluation. Supports Qwen2.5VL, InternVL, GLM4V, DeepSeekVL.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin hiascend-forumThis skill uses the workspace's default tool permissions.
This Skill guides users through training multimodal understanding (VLM) models on Huawei Ascend NPU using MindSpeed-MM. It uses Qwen2.5VL-3B as the flagship example and covers the end-to-end fine-tuning workflow.
Routes to MindSpeed-MM skills for Huawei Ascend NPU multimodal training pipelines by model type (VLM/understanding, generative, omni, audio). Provides workflow overviews and model index.
Fine-tunes language and vision models using TRL (SFT, DPO, GRPO, reward modeling) or Unsloth on Hugging Face Jobs cloud GPUs, with GGUF conversion for local deployment.
Trains or fine-tunes TRL language models on Hugging Face Jobs using SFT, DPO, GRPO, reward modeling, and GGUF export without local GPUs.
Share bugs, ideas, or general feedback.
This Skill guides users through training multimodal understanding (VLM) models on Huawei Ascend NPU using MindSpeed-MM. It uses Qwen2.5VL-3B as the flagship example and covers the end-to-end fine-tuning workflow.
Critical: For most VLMs (Qwen2.5VL, Qwen2VL, InternVL, GLM4V, DeepSeekVL2), follow the manual install flow below. Do NOT use
bash scripts/install.sh— official MindSpeed-MM docs state it only fully supports Qwen3/Qwen3.5. (For Qwen3VL / Qwen3.5, use one-click install +bash examples/qwen3_5/install_extensions.sh.)
git clone https://gitcode.com/Ascend/MindSpeed-MM.git /root/workspace/MindSpeed-MM
git clone https://github.com/NVIDIA/Megatron-LM.git /root/workspace/Megatron-LM
cd /root/workspace/Megatron-LM && git checkout core_v0.12.1
cp -r megatron /root/workspace/MindSpeed-MM/
cd /root/workspace/MindSpeed-MM
Always use pre-built wheels from Ascend PyTorch Releases. Do not use pip install torch==2.7.1 — aarch64 wheels on PyPI are unreliable.
# Download matching wheels from https://gitcode.com/Ascend/pytorch/releases
# Replace cp310 with cp311 if using Python 3.11
pip install torch-2.7.1-cp310-cp310-manylinux_2_28_aarch64.whl
pip install torch_npu-2.7.1rc1-cp310-cp310-manylinux_2_28_aarch64.whl
# For x86_64: same naming pattern with _x86_64 suffix
# Required by torch_npu and other components:
pip install numpy pyyaml scipy attrs decorator psutil
git clone https://gitcode.com/Ascend/MindSpeed.git /root/workspace/MindSpeed
cd /root/workspace/MindSpeed
git checkout 93c45456c7044bacddebc5072316c01006c938f9
pip install -r requirements.txt
pip install -e .
cd /root/workspace/MindSpeed-MM
# Installs base stack from pyproject.toml: transformers==4.57.0, diffusers==0.30.3, peft==0.7.1, etc.
pip install -e .
source /usr/local/Ascend/cann/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
python3 -c "
import torch, torch_npu
assert torch.npu.is_available(), 'NPU not available'
print(f'NPU count: {torch_npu.npu.device_count()}')
import transformers, diffusers, mindspeed
print(f'transformers={transformers.__version__}, diffusers={diffusers.__version__}')
"
Then proceed to Step 0 below for model-specific dependencies. For Qwen2.5VL the base is sufficient; other models need overlays (see Step 0 table).
For detailed troubleshooting and alternative install paths, see mindspeed-mm-env-setup.
| Model | Framework | Entry Script | Sizes | Status |
|---|---|---|---|---|
| Qwen2.5VL | Megatron | pretrain_vlm.py | 3B/7B/32B/72B | Released |
| Qwen2VL | Megatron | pretrain_vlm.py | 2B/7B/72B | Released |
| Qwen3VL | FSDP2 | pretrain_transformers.py | 8B/30B/32B/235B | Released |
| InternVL2.5 | Megatron | pretrain_internvl.py | 4B/78B | Released |
| InternVL3 | Megatron | pretrain_vlm.py | 8B/78B | Released |
| InternVL3.5 | FSDP2 | pretrain_transformers.py | -- | Released |
| GLM4.1V | Megatron | pretrain_vlm.py | 9B | Released |
| GLM4.5V | FSDP2 | pretrain_transformers.py | -- | Released |
| DeepSeekVL2 | Megatron | pretrain_deepseekvl.py | -- | Released |
| DeepSeekOCR | Custom | finetune_ocr.py | -- | Prototype |
| DeepSeekOCR2 | Custom | finetune_ocr2.py | -- | Prototype |
| Ming | Custom | finetune_vl.py | -- | Prototype |
Entry Script Note: VLM models use different entry scripts. Check the shell script in
examples/<model_name>/for the exact command — do not assume from the model name.
The workflow for training any VLM model in MindSpeed-MM follows a universal pattern:
examples/<model_name>/ for available shell scripts (finetune, pretrain, inference, evaluate)mm-convert (Megatron models), DCP conversion (FSDP2 models), or uses HF weights directly (custom trainer models)bash examples/<model_name>/finetune_<model>_<size>.shFor other models, adapt the Qwen2.5VL quick start below. Use the Model Registry to look up the exact entry script, converter, and backend for your target model.
All VLM models in MindSpeed-MM follow one of three framework patterns:
Pattern 1: Megatron (Qwen2.5VL, Qwen2VL, InternVL2.5, InternVL3, GLM4.1V, DeepSeekVL2)
# Uses JSON config files: data.json + model.json
# Weight conversion: mm-convert <Converter> hf_to_mm
torchrun $DISTRIBUTED_ARGS pretrain_vlm.py \
--mm-data examples/<model>/data.json \
--mm-model examples/<model>/model.json \
--load $MM_WEIGHT_PATH ...
Pattern 2: FSDP2 (Qwen3VL, InternVL3.5, GLM4.5V)
# Uses YAML config file; requires CUDA_DEVICE_MAX_CONNECTIONS=2
# Weight conversion: mm-convert <Converter> hf_to_dcp
# MoE models may need: --init-model-with-meta-device
CUDA_DEVICE_MAX_CONNECTIONS=2 torchrun $DISTRIBUTED_ARGS pretrain_transformers.py \
--fsdp2-config-path examples/<model>/fsdp2_config.yaml ...
Pattern 3: Custom Trainers (DeepSeekOCR, DeepSeekOCR2, Ming)
# Standalone entry scripts; uses HF weights directly (no conversion needed)
torchrun $DISTRIBUTED_ARGS examples/<model>/finetune_<model>.py <args>
This is the flagship Megatron-pattern example. For other models, follow the same workflow structure but use configs and scripts from
examples/<model_name>/. FSDP2 models use YAML configs instead of JSON; Custom trainer models skip weight conversion entirely. See the "Framework Patterns" section above.
Different VLM models have different version requirements for libraries such as transformers. Some of these requirements conflict, so environment isolation is necessary.
| Model | Additional Dependencies | Notes |
|---|---|---|
| Qwen2.5VL | None required | Base environment is sufficient |
| Qwen3VL | pip install -r examples/qwen3vl/requirements.txt | Requires latest transformers; may conflict with other models |
| DeepSeekOCR | pip install -r examples/deepseekocr/requirements.txt | Downgrades transformers to 4.46.3; must use isolated environment |
| InternVL3 | None required | Base environment is sufficient (directory is examples/internvl3/, no extra requirements.txt) |
Important: Qwen3VL and DeepSeekOCR have conflicting transformers version requirements. If you need to support both, use separate Docker containers or conda environments. See per-model-deps.md for details.
# Download from HuggingFace (requires proxy)
http_proxy=http://127.0.0.1:58232 https_proxy=http://127.0.0.1:58232 \
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct \
--local-dir ckpt/hf_path/Qwen2.5-VL-3B-Instruct
MindSpeed-MM uses its own weight format (MM format). Weights must be converted before training.
mm-convert Qwen2_5_VLConverter hf_to_mm \
--cfg.mm_dir "ckpt/mm_path/Qwen2.5-VL-3B-Instruct" \
--cfg.hf_config.hf_dir "ckpt/hf_path/Qwen2.5-VL-3B-Instruct" \
--cfg.parallel_config.llm_pp_layers [[36]] \
--cfg.parallel_config.vit_pp_layers [[32]] \
--cfg.parallel_config.tp_size 1
Key parameter descriptions:
| Parameter | Description |
|---|---|
--cfg.mm_dir | Output directory for converted MM format weights |
--cfg.hf_config.hf_dir | Source HF weight directory |
--cfg.parallel_config.llm_pp_layers | PP partitioning for LLM layers. [[36]] means all 36 layers on one device when PP=1 |
--cfg.parallel_config.vit_pp_layers | PP partitioning for ViT layers. [[32]] means all 32 layers on one device when PP=1 |
--cfg.parallel_config.tp_size | Tensor parallelism degree; must match the training configuration |
Important:
llm_pp_layersandvit_pp_layersmust match thepipeline_num_layersin model.json. When PP > 1, you need to split the layer counts, e.g.,[[18,18]]for PP=2.
| Model | Converter Name | Direction | Notes |
|---|---|---|---|
| Qwen2.5VL | Qwen2_5_VLConverter | hf_to_mm / mm_to_hf | See example above |
| Qwen2VL | Qwen2VLConverter | hf_to_mm / mm_to_hf | Same pattern as Qwen2.5VL |
| Qwen3VL | Qwen3VLConverter | hf_to_dcp / dcp_to_hf | FSDP2: uses DCP format, not mm format |
| Qwen3VL (Megatron) | Qwen3VLMegatronConverter | hf_to_mm / mm_to_hf | Alternative Megatron-path converter |
| InternVL2.5/3 | InternVLConverter | hf_to_mm / mm_to_hf | --cfg.parallel_config.vit_pp_layers [[45]] |
| InternVL3.5 | ExpertMergeDcpConverter | hf_to_dcp / dcp_to_hf | FSDP2 path; MoE model uses expert merge converter |
| DeepSeekVL2 | DeepSeekVLConverter | hf_to_mm only | mm_to_hf is unimplemented (stub). Note MoE expert layers |
| GLM4.1V | GlmConverter | hf_to_mm / mm_to_hf | Generic Megatron converter |
| GLM4.5V | ExpertMergeDcpConverter | hf_to_dcp / dcp_to_hf | FSDP2 path; MoE model uses expert merge converter |
| MoE models | ExpertMergeDcpConverter | merge / split | Merge/split MoE expert weights for DCP checkpoints |
| DeepSeekOCR/OCR2 | None | -- | Uses HF weights directly; no conversion needed |
| Ming | None | -- | Uses HF weights directly; no conversion needed |
VLM training uses the MLLM JSON format. This example uses COCO2017 + LLaVA-Instruct-150K.
# 1. Download COCO2017 training images
mkdir -p data/COCO2017/train2017
# Extract to data/COCO2017/train2017/
# 2. Download LLaVA-Instruct-150K
# Download llava_instruct_150k.json to data/
Important: The conversion script uses hardcoded relative paths. It expects
./data/llava_instruct_150k.jsonas input and./data/COCO2017/train2017/for image lookup. Run it from the MindSpeed-MM root directory, and ensure data is at./data/(or create symlinks).
cd /root/workspace/MindSpeed-MM # Must run from repo root
python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py
Output: data/mllm_format_llava_instruct_data.json
The script converts LLaVA format (conversations/image) to MLLM format (messages/images), which is what MindSpeed-MM expects.
data/
├── COCO2017/train2017/ # Image files
│ ├── 000000000009.jpg
│ ├── 000000000025.jpg
│ └── ...
├── llava_instruct_150k.json # Original data (for conversion)
└── mllm_format_llava_instruct_data.json # MLLM format (for training)
Field name matching: The field names in your data JSON must match the
attrsection in data.json config. Default mapping:messagesfor conversations,imagesfor image paths. Using different field names (e.g.,conversationsinstead ofmessages) will causeKeyError.
For detailed MLLM JSON format specifications, see data-format.md.
Qwen2.5VL-3B fine-tuning requires two JSON configuration files:
Path: examples/qwen2.5vl/data_3b.json
{
"dataset_param": {
"dataset_type": "huggingface",
"preprocess_parameters": {
"model_name_or_path": "./ckpt/hf_path/Qwen2.5-VL-3B-Instruct"
},
"basic_parameters": {
"dataset_dir": "./data",
"dataset": "./data/mllm_format_llava_instruct_data.json",
"cache_dir": "./data/cache_dir"
}
}
}
Must edit before running — update these 3 paths to match your environment:
| Field | Description | What to set |
|---|---|---|
model_name_or_path | Original HF weights (not MM-converted) | e.g., /home/weights/Qwen2.5-VL-3B-Instruct |
dataset | MLLM format JSON path | e.g., ./data/mllm_format_llava_instruct_data.json |
cache_dir | Data preprocessing cache (LOCAL path) | e.g., ./data/cache_dir — not NFS/shared mount |
Multi-node Note:
cache_dirstores preprocessing cache. During multi-node training, each node must use a local path and must not point to an NFS shared directory, otherwise concurrent write conflicts will occur.
Path: examples/qwen2.5vl/model_3b.json
model.json defines three components of the model architecture:
| Component | Type | Layers | Description |
|---|---|---|---|
image_encoder | qwen2vit | 32 | Vision encoder (ViT) |
vision_projector | lnmlp | -- | Vision-to-text projection layer |
text_decoder | qwen2_5_lm | 36 | Text decoder (LLM) |
Important: The
pipeline_num_layersfield defines the PP partitioning scheme and must exactly match thellm_pp_layers/vit_pp_layersused during weight conversion.
The training launch script is located at examples/qwen2.5vl/finetune_qwen2_5_vl_3b.sh.
LOAD_PATH="ckpt/mm_path/Qwen2.5-VL-3B-Instruct" # Converted MM weights
SAVE_PATH="save_dir" # Training output save path
DATA_CONFIG="examples/qwen2.5vl/data_3b.json" # Data configuration
MODEL_CONFIG="examples/qwen2.5vl/model_3b.json" # Model configuration
TP=1; PP=1; CP=1 # Parallelism configuration
MBS=1; GRAD_ACC_STEP=32 # Batch size configuration
| Parameter | Description | Typical Value |
|---|---|---|
--tensor-model-parallel-size | Tensor parallelism degree (TP) | 1-8 |
--pipeline-model-parallel-size | Pipeline parallelism degree (PP) | 1-4 |
--context-parallel-size | Context parallelism degree (CP) | 1-2 |
--micro-batch-size | Micro-batch size per NPU | 1 |
--global-batch-size | Global batch size | 32 |
--lr | Learning rate | 1e-5 to 2e-5 |
--train-iters | Number of training iterations | 1000-5000 |
--bf16 | BFloat16 precision | Always recommended |
--use-distributed-optimizer | Distributed optimizer | Recommended for multi-device |
--no-load-optim | Do not load optimizer state | Required for fine-tuning |
--no-load-rng | Do not load RNG state | Required for fine-tuning |
Docker users: If your container lacks --ipc=host or --shm-size=16g, set --num-workers 0 in the training script to avoid Bus error from DataLoader workers.
torchrun $DISTRIBUTED_ARGS pretrain_vlm.py $GPT_ARGS $MM_ARGS $OUTPUT_ARGS
Note: Qwen2.5VL uses
pretrain_qwen2vl.pyorpretrain_vlm.pyas the entry point. Different versions of the script may vary; refer to the actual script in your installation.
# Single node, 8 NPUs
MASTER_ADDR=localhost
NNODES=1
NODE_RANK=0
# Multi-node (2 nodes)
# Node 0 (master node)
MASTER_ADDR=192.168.1.100 # Master node IP
NNODES=2
NODE_RANK=0
# Node 1
MASTER_ADDR=192.168.1.100 # Same as above; points to the master node
NNODES=2
NODE_RANK=1
bash examples/qwen2.5vl/finetune_qwen2_5_vl_3b.sh
Key metrics to monitor in training logs:
lm loss: Language model loss; should decrease steadilygrad norm: Gradient norm; excessively large values may indicate the need to adjust the learning rateiteration time: Time per training iterationAfter training, you can run inference verification. Note: no 3B-specific inference script exists; use the 7B script and adjust the model/config paths:
# Edit inference_qwen2_5_vl_7b.sh to point to your 3B checkpoint, then:
bash examples/qwen2.5vl/inference_qwen2_5_vl_7b.sh
Inference requires an inference JSON configuration (inference_qwen2_5_vl_7b.json) specifying the model path and test images. The inference script uses the inference_vlm.py entry point.
# Similarly, use the 7B evaluation script with 3B paths:
bash examples/qwen2.5vl/evaluate_qwen2_5_vl_7b.sh
Note: Only 7B and 72B inference/evaluation scripts are provided. For 3B or 32B, copy and adapt the 7B scripts with the appropriate model config and checkpoint paths.
| Model Size | NPU Count | Recommended TP | Recommended PP | MBS |
|---|---|---|---|---|
| 3B | 1-8 | 1 | 1 | 1 |
| 7B | 8 | 1-2 | 1-2 | 1 |
| 32B | 8-16 | 4 | 2 | 1 |
| 72B-78B | 16-32 | 8 | 2-4 | 1 |
Qwen2.5VL supports DPO post-training using the posttrain_qwen2vl_dpo.py entry point:
bash examples/qwen2vl/finetune_qwen2vl_72b_dpo.sh
See the relevant files in the examples/qwen2vl/ directory for DPO configuration and data format details.
Q: Training fails with shape mismatch after weight conversion
pipeline_num_layers does not match the llm_pp_layers / vit_pp_layers used during weight conversion. Check that the layer count configuration in model.json matches the parameters passed to the mm-convert command.
Q: Data loading hangs during multi-node training
cache_dir is set to an NFS shared mount path. Change it to a local path (e.g., /tmp/cache_dir) so each node preprocesses data independently.
Q: transformers version conflict
Different VLM models require different versions of transformers. Qwen3VL needs the latest version, while DeepSeekOCR needs 4.46.3. Use separate containers to isolate environments.
Q: Images fail to load during inference
Check that the image paths in the inference JSON are correct and that the image files exist in a supported format (JPEG/PNG).
Q: Training OOM (out of memory)
Reduce --micro-batch-size to 1, enable --use-distributed-optimizer, or increase TP/PP parallelism.
Q: Training fails with Communication_Error_Bind_IP_Port
A stale process from a previous run is holding the port. Fix:
ps aux | grep torchrun | grep -v grep | awk '{print $2}' | xargs kill -9
# Or change MASTER_PORT in the training script