Help us improve
Share bugs, ideas, or general feedback.
From ai-infra-auto-driven-skills
Builds an operator-level compute template for an LLM, estimating FLOPs, tensor shapes, MFU, and parallelism trade-offs for serving configurations.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:model-compute-simulationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when the question is about operator order, tensor dimensions, FLOPs,
Calculates Model FLOPs Utilization (MFU) for large model training from model config file and training logs. Supports Dense and MoE architectures.
Parses torch profiler traces for LLM inference at forward-pass, layer, and kernel granularity. Outputs timing tables with anchor kernels and layer boundaries for Perfetto navigation.
Optimizes ML inference latency via model compression, distillation, pruning, quantization, caching strategies, and edge deployment patterns.
Share bugs, ideas, or general feedback.
Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.
Before running a simulation, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Resolves to config in model-config-index.json; determines entire architecture | Ask user or infer from trace context | — (required) |
| Config accuracy | Indexed values may differ from actual serving config (e.g. routed_expert_intermediate_size, compress_ratios) | Ask user to provide config.json or verify key params against HuggingFace | Use indexed values with a caveat |
| GPU type | Determines peak FLOPS for MFU denominator | Ask user | — (required for MFU) |
| dtype (bf16 / fp8) | Affects peak FLOPS selection; fp8 doubles peak | Ask user | bf16 |
| Batch size & seq len | Directly affects FLOPs and tensor shapes | Ask user | B=1, S=1 (decode) |
| TP / DP / EP | TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs | Ask user | TP=8, DP=1, EP=8 |
| Measured latency (ms) | Required for MFU numerator; must be per-GPU forward-pass wall-clock | Ask user or extract from a profiler trace | — (optional, no MFU without it) |
If the model is not in model-config-index.json, ask the user for a
config.json path or add an indexed config before running estimates.
Resolve the model name and load its configuration parameters:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models
The script resolves the model name against references/model-config-index.json, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).
If the model is not indexed, tell the user to provide a config.json path or request an index update.
Run the simulator with batch size, sequence length, and parallelism configuration:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16
The simulator prints:
For decode: use --seq-len 1.
For prefill: use --seq-len <prompt_length>.
Provide the measured forward-pass latency to compute MFU:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--measured-ms 15.0
MFU = theoretical_min_time / measured_time × 100%
The simulator prints:
GPU peak FLOPS are loaded from references/gpu-specs.json. The bundled
hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM
180GB. Use aliases such as --gpu h100, --gpu h200, or --gpu b200 when
running on those local boxes.
When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.
--kernel-flow (kernel-level MFU, recommended)Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-flow @/tmp/layer3_detail.json
The --kernel-flow parameter accepts a JSON string or @file path. It produces
a kernel-level MFU table that preserves all kernel rows from the compute
flow and adds:
Mapped Op: which operator this kernel maps toFLOPs: operator's total FLOPsTheo(us): theoretical minimum timeMFU%: measured FLOPs utilizationshape_in→shape_out: operator tensor dimensionsWhen --kernel-flow is provided, the static per-operator template is omitted
because the kernel-level MFU table already carries per-kernel shape and FLOPs
information. The output keeps the model summary, serving configuration, total
FLOPs, and kernel-level MFU table.
Mapping rules:
FP8 kernel MFU correction: Kernels in categories moe (fused_moe_kernel)
and gemm_fp8 use fp8 math internally even when --dtype bf16 is specified.
For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS
(2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a
superscript ⁸ (for example, 63.7%⁸) to show that the fp8 denominator was
used. gemm_bf16 kernels still use the bf16 peak FLOPS denominator.
--kernel-detail (operator-level MFU, legacy)Same input as --kernel-flow but outputs an operator-level summary table
(aggregated by operator, not per-kernel). Use when you want a compact view.
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-ms '{
"mla": 4.922, "moe": 1.644, "allreduce": 0.769,
"hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
"gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
"rope": 0.209, "topk": 0.122, "activation": 0.071,
"other": 0.437
}'
The --kernel-ms parameter accepts a JSON object mapping kernel category names
to their measured durations in milliseconds. It uses FLOPs-proportional
distribution across entire categories, which is less precise than --kernel-detail
because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.
Output includes:
List known model IDs:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models
List known GPU types:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus
Emit JSON for automation:
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json
Include:
--kernel-flow provided):
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out--kernel-detail or --measured-ms provided):
Use scripts/extract_compute_flow_from_trace.py to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.
# Extract compute flow from a trace
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz --format text
# Compare trace against static template
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz \
--compare qwen3-235b-a22b \
--batch-size 1 --seq-len 1 --tp 8 --ep 8
When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:
Static template (model_compute_simulator.py + model-config-index.json) — fast, covers known models
Trace extraction (extract_compute_flow_from_trace.py) — validates template against real execution
Inference framework source code — when trace is insufficient (missing Input Dims, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
python/sglang/srt/models/<model_name>.py — contains the forward() method with the exact operator sequence, tensor shapes, and parallelism split logicvllm/model_executor/models/<model_name>.pycpp/tensorrt_llm/pyexecutor/py_executor.cpp + model config filesWhen consulting framework source, focus on:
forward() method: operator call order and residual connectionsq_lora_rank, o_lora_rank)Action: If the framework source reveals discrepancies with the static template, update model-config-index.json and/or build_layer_ops() accordingly.
| Limitation | Detail | Workaround |
|---|---|---|
record_shapes=True required | Trace must be captured with shape recording enabled; without it, Input Dims fields are absent and FLOPs cannot be computed | SGLang live capture and vLLM torch_profiler_with_stack=true already enable this; TensorRT-LLM requires a py_executor.py override adding record_shapes=True |
| CUDA Graph mode | During graph replay, cpu_op events may only appear once (at capture time); shape information for replayed iterations is not re-recorded | The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage |
| TP-sliced dimensions | Trace shows post-TP-split dimensions (e.g. H/TP), not the full-model view | Use --tp in --compare mode to scale trace FLOPs back to full-model equivalents |
| Scope attribution quality | Python scope depends on with_stack=True; some frameworks or compiled paths may produce shallow or missing scope chains | Graceful degradation: ops with unresolved scope are categorized as "other" |
| Not a replacement for static templates | Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path | Use trace extraction to verify templates for new models, then update model-config-index.json if discrepancies are found |
references/model-config-index.json: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).references/gpu-specs.json: GPU peak FLOPS specifications for MFU calculation.scripts/extract_compute_flow_from_trace.py: trace-based compute flow extraction and template validation tool.