Help us improve
Share bugs, ideas, or general feedback.
From external-gitcode-ascend-skills
Calculates Model FLOPs Utilization (MFU) for large model training from model config file and training logs. Supports Dense and MoE architectures.
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin migration-ascend-torchnpu-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:training-mfu-calculatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- 用户需要**计算MFU**:评估大模型训练的硬件利用率。
Builds an operator-level compute template for an LLM, estimating FLOPs, tensor shapes, MFU, and parallelism trade-offs for serving configurations.
Estimates GPU VRAM requirements for training configurations, checks model fit on GPUs, and suggests memory optimization. Use when planning GPU allocation.
Train or fine-tune TRL language models on Hugging Face Jobs using SFT, DPO, GRPO, Reward Modeling, with GGUF export for local deployment.
Share bugs, ideas, or general feedback.
from mfu_calculator import MODEL_CONFIGS, MFUCalculator, TrainingConfig
# 使用预定义模型配置
model_config = MODEL_CONFIGS["llama-7b"]
# 训练配置
training_config = TrainingConfig(
batch_size=512,
micro_batch_size=4,
num_gpus=128,
seq_length=2048,
step_time=2.5, # 每步时间(秒)
hardware_peak_flops=312, # A100峰值算力
hardware_name="A100",
)
# 计算MFU
calculator = MFUCalculator(model_config, training_config)
print(calculator.generate_report())
计算MFU需要三类信息,以下是获取方式:
从 HuggingFace 格式的 config.json 文件中读取模型架构参数:
{
"hidden_size": 4096,
"num_hidden_layers": 94,
"vocab_size": 151936,
"num_attention_heads": 64,
"num_key_value_heads": 4,
"head_dim": 128,
"intermediate_size": 12288,
"moe_intermediate_size": 1536,
"num_experts_per_tok": 8,
"num_experts": 128
}
参数映射表:
| config.json 字段 | MFU计算参数 | 说明 |
|---|---|---|
hidden_size | hidden_size | 隐藏层维度 |
num_hidden_layers | layer_num | 层数 |
vocab_size | vocab_size | 词表大小 |
num_attention_heads | head_num | 注意力头数 |
num_key_value_heads | kv_head_num | KV头数 (GQA) |
head_dim | head_dim | 头维度 |
intermediate_size | intermediate_size | FFN中间层 |
moe_intermediate_size | expert_hidden_size | MoE专家中间层 |
num_experts_per_tok | topk | 激活专家数 |
从训练启动脚本(如 run_train.sh、train.py 参数)中读取:
常见启动脚本示例:
# 示例1: torchrun 启动
torchrun --nproc_per_node=8 --nnodes=4 train.py \
--seq_length 8192 \
--global_batch_size 32 \
...
# 示例2: accelerate 启动
accelerate launch --num_processes 32 train.py \
--max_seq_length 8192 \
--train_batch_size 32 \
...
参数映射表:
| 启动脚本参数 | MFU计算参数 | 说明 |
|---|---|---|
--seq_length / --max_seq_length | seq_length | 训练序列长度 |
--global_batch_size / --train_batch_size | gbs | 全局batch size |
--nproc_per_node × --nnodes / --num_processes | num_gpus | GPU/NPU总数 |
计算 GPU/NPU 数量:
# 单机多卡
num_gpus = nproc_per_node
# 多机多卡
num_gpus = nproc_per_node * nnodes
用户直接提供每步耗时(step_time):
从训练日志中提取 step_time:
日志示例1(Transformers格式):
[2024-01-15 10:23:45] step=100 loss=2.345 learning_rate=1e-4 step_time=4.4s
[2024-01-15 10:23:50] step=101 loss=2.342 learning_rate=1e-4 step_time=4.3s
日志示例2(Megatron格式):
iteration 100/ 1000 | elapsed time per iteration (ms): 4400 | ...
iteration 101/ 1000 | elapsed time per iteration (ms): 4350 | ...
日志示例3(PyTorch格式):
Step 100: loss=2.345, time=4.40s, throughput=1861.82 tokens/s/GPU
提取方法:
import re
def extract_step_time(log_file):
"""从日志文件提取step_time"""
times = []
with open(log_file, 'r') as f:
for line in f:
# 匹配 "step_time=4.4s" 或 "time=4.40s"
match = re.search(r'(?:step_)?time[=:]\s*([\d.]+)\s*s?', line)
if match:
times.append(float(match.group(1)))
# 取稳定阶段的平均值(跳过前10步预热)
return sum(times[10:]) / len(times[10:]) if len(times) > 10 else None
默认值:Ascend910B2 = 353 TFLOPS
获取方式:
echo $ASCEND_DEVICE_TYPEnpu-smi infonvidia-smi --query-gpu=name --format=csvagent_skills/mfu-calculator/
├── SKILL.md # 本文件
├── reference/
│ └── mfu_reference.md # MFU计算参考文档
└── scripts/
└── mfu_calculator.py # MFU计算工具实现
Dense模型:
MoE模型:
from mfu_calculator import MODEL_CONFIGS, MFUCalculator, TrainingConfig
# 查看支持的预定义模型
print(MODEL_CONFIGS.keys())
# 输出: dict_keys(['llama-7b', 'llama-13b', 'llama-70b', 'qwen-7b', 'qwen-72b', 'mixtral-8x7b'])
# 选择模型
model_config = MODEL_CONFIGS["llama-70b"]
# 配置训练参数
training_config = TrainingConfig(
batch_size=1024,
micro_batch_size=8,
num_gpus=256,
seq_length=2048,
step_time=3.8,
hardware_peak_flops=989,
hardware_name="H100",
)
# 计算MFU
calculator = MFUCalculator(model_config, training_config)
print(calculator.generate_report())
from mfu_calculator import ModelConfig, MFUCalculator, TrainingConfig
# 自定义Dense模型
model_config = ModelConfig(
hidden_size=8192,
num_layers=80,
vocab_size=128000,
seq_length=4096,
num_attention_heads=64,
num_key_value_heads=8, # GQA
intermediate_size=22016,
ffn_type="swiglu",
)
# 训练配置
training_config = TrainingConfig(
batch_size=2048,
micro_batch_size=16,
num_gpus=512,
seq_length=4096,
step_time=5.2,
hardware_peak_flops=313,
hardware_name="Ascend910B",
)
# 计算MFU
calculator = MFUCalculator(model_config, training_config)
mfu = calculator.calculate_mfu()
print(f"MFU: {mfu*100:.2f}%")
from mfu_calculator import ModelConfig, MFUCalculator, TrainingConfig
# MoE模型配置
model_config = ModelConfig(
hidden_size=4096,
num_layers=32,
vocab_size=32000,
seq_length=2048,
num_attention_heads=32,
intermediate_size=14336,
ffn_type="swiglu",
is_moe=True,
num_experts=8,
num_experts_per_tok=2,
expert_intermediate_size=14336,
)
# 训练配置
training_config = TrainingConfig(
batch_size=512,
num_gpus=128,
seq_length=2048,
step_time=4.1,
hardware_peak_flops=312,
hardware_name="A100",
)
# 计算MFU
calculator = MFUCalculator(model_config, training_config)
print(calculator.generate_report())
import json
from mfu_calculator import ModelConfig, MFUCalculator, TrainingConfig, cal_flops_simple, cal_mfu_simple
# 1. 从 config.json 读取模型配置
with open("config.json", "r") as f:
config = json.load(f)
# 2. 从启动脚本或用户提供训练配置
seq_length = 8192 # 从 --seq_length 参数
gbs = 32 # 从 --global_batch_size 参数
num_gpus = 32 # 从 nproc_per_node * nnodes 计算
step_time = 4.4 # 从训练日志或用户提供
# 3. 计算MFU
flops = cal_flops_simple(
hidden_size=config["hidden_size"],
expert_hidden_size=config.get("moe_intermediate_size", config.get("intermediate_size", 4 * config["hidden_size"])),
head_num=config["num_attention_heads"],
kv_head_num=config.get("num_key_value_heads", config["num_attention_heads"]),
sequence_length=seq_length,
layer_num=config["num_hidden_layers"],
vocab_size=config["vocab_size"],
topk=config.get("num_experts_per_tok", 1),
gbs=gbs,
head_dim=config.get("head_dim", config["hidden_size"] // config["num_attention_heads"])
)
# 默认使用 Ascend910B2 的峰值算力
hw_flops = 353 * 1e12
mfu = cal_mfu_simple(
real_flops=flops,
num_gpu=num_gpus,
sec_per_step=step_time,
hw_flops_per_gpu=hw_flops
)
print(f"模型总FLOPs: {flops/1e15:.2f} PFLOPs")
print(f"MFU: {mfu * 100:.2f}%")
import json
from mfu_calculator import cal_flops_simple, cal_mfu_simple
# Qwen3 MoE config.json
config = {
"hidden_size": 4096,
"num_hidden_layers": 94,
"vocab_size": 151936,
"num_attention_heads": 64,
"num_key_value_heads": 4,
"head_dim": 128,
"intermediate_size": 12288,
"moe_intermediate_size": 1536,
"num_experts_per_tok": 8,
"num_experts": 128
}
# 训练配置(从启动脚本读取)
seq_length = 8192
gbs = 32
num_gpus = 32
step_time = 4.4 # 用户提供或从日志读取
# 计算FLOPs
flops = cal_flops_simple(
hidden_size=config["hidden_size"],
expert_hidden_size=config["moe_intermediate_size"],
head_num=config["num_attention_heads"],
kv_head_num=config["num_key_value_heads"],
sequence_length=seq_length,
layer_num=config["num_hidden_layers"],
vocab_size=config["vocab_size"],
topk=config["num_experts_per_tok"],
gbs=gbs,
head_dim=config["head_dim"]
)
# 计算MFU(默认 Ascend910B2)
hw_flops = 353 * 1e12
mfu = cal_mfu_simple(
real_flops=flops,
num_gpu=num_gpus,
sec_per_step=step_time,
hw_flops_per_gpu=hw_flops
)
print(f"模型总FLOPs: {flops/1e15:.2f} PFLOPs")
print(f"MFU: {mfu * 100:.2f}%")
# 输出:
# 模型总FLOPs: 9.83 PFLOPs
# MFU: 19.78%
计算公式:
MFU = 有效模型计算FLOPs / 硬件理论峰值FLOPs
评估标准:
| MFU范围 | 评估等级 | 说明 |
|---|---|---|
| ≥40% | 优秀 | 硬件利用率很高 |
| 30-40% | 良好 | 硬件利用率较好 |
| 20-30% | 一般 | 有优化空间 |
| <20% | 需要优化 | 存在明显性能问题 |
计算公式:
Throughput = (gbs × seq_length) / (step_time × num_gpus)
单位:tokens/s/GPU
计算公式:
Cluster_Throughput = (Throughput × num_gpus × 3600 × 24) / 10^12
单位:T tokens/day
| GPU型号 | FP16/BF16 TFLOPS |
|---|---|
| A100 | 312 |
| H100 | 989 |
| V100 | 125 |
| RTX4090 | 165 |
| NPU型号 | FP16 TFLOPS | 说明 |
|---|---|---|
| Ascend910 | 256 | 台积电7nm EUV |
| Ascend910A | 256 | 台积电7nm EUV |
| Ascend910A2 | 256 | 台积电7nm EUV |
| Ascend910B | 320 | 中芯国际N+1 |
| Ascend910B1 | 320 | 中芯国际N+1 |
| Ascend910B2 | 353 | 默认值 |
| Ascend910B3 | 353 | 中芯国际N+1 |
| Ascend910C | 800 | 双Die封装 |
标准Attention:
FLOPs = 8BSH² + 4BS²H
GQA (Grouped Query Attention):
FLOPs = BSH² × (4 + kv_heads/num_heads) + 4BS²H
标准FFN:
FLOPs = 4BSH × intermediate_size
SwiGLU FFN:
FLOPs = 6BSH × intermediate_size
MoE + SwiGLU:
FLOPs = 6BSH × activated_experts × expert_intermediate_size
FLOPs = 6BSH × vocab_size
单层FLOPs = attention_flops + ffn_flops
模型FLOPs = 单层FLOPs × num_layers
单步FLOPs = 3 × 模型FLOPs + logits_flops
说明:假设反向传播FLOPs = 2 × 前向传播FLOPs
============================================================
MFU计算报告
============================================================
【模型配置】
- 模型类型: Dense
- 隐藏层维度: 4096
- 层数: 32
- 注意力头数: 32
- KV头数: 32
- 序列长度: 2048
- 词表大小: 32000
- FFN类型: swiglu
- FFN中间层大小: 11008
【训练配置】
- 全局batch size: 512
- 微批次大小: 4
- GPU数量: 128
- 每步时间: 2.500秒
- 硬件: A100
- 硬件峰值: 312 TFLOPS
【FLOPs分析】
- 单步训练FLOPs: 1.23e+18
- 有效计算FLOPS: 4.92e+17 (492.00 TFLOPS)
【性能指标】
- MFU: 52.31%
- 单卡吞吐: 3276.80 tokens/s/GPU
- 集群吞吐: 36.28 T tokens/day
【性能评估】
- MFU评估: 良好 (50-60%)
============================================================
可能原因:
解决方案:
建议:
说明:
num_experts_per_tok参数说明:
kv_heads/num_heads比例import time
# 测量多个步骤取平均值
step_times = []
for i in range(10):
start = time.time()
train_step()
step_times.append(time.time() - start)
avg_step_time = sum(step_times) / len(step_times)
# 根据显存和硬件选择合适的micro_batch_size
# 一般建议:尽可能大,但不要OOM
# A100 80GB示例
micro_batch_size = 8 # 对于7B模型
# H100 80GB示例
micro_batch_size = 16 # 对于7B模型
# 对比不同硬件的MFU
configs = [
("A100", 312),
("H100", 989),
("Ascend910B", 313),
]
for name, peak_flops in configs:
training_config.hardware_name = name
training_config.hardware_peak_flops = peak_flops
calculator = MFUCalculator(model_config, training_config)
mfu = calculator.calculate_mfu()
print(f"{name}: MFU = {mfu*100:.2f}%")
reference/mfu_reference.mdscripts/mfu_calculator.py本skill提供了完整的MFU计算方案:
通过本skill,可以准确评估大模型训练的硬件利用率,指导性能优化工作。