Skill

model-compute-simulation

Builds an operator-level compute template for an LLM, estimating FLOPs, tensor shapes, MFU, and parallelism trade-offs for serving configurations.

Python

backend

ai-ml

npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skills

Popularity

Stars

542

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-infra-auto-driven-skills:model-compute-simulation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this when the question is about operator order, tensor dimensions, FLOPs,

Supporting Files

references/gpu-specs.jsonreferences/model-config-index.jsonscripts/extract_compute_flow_from_trace.pyscripts/model_compute_simulator.py

SKILL.md

264 lines · ~3.3k tokens

Similar Skills

training-mfu-calculator

Calculates Model FLOPs Utilization (MFU) for large model training from model config file and training logs. Supports Dense and MoE architectures.

2 files

external-gitcode-ascend-skills

llm-pipeline-analysis

542

Parses torch profiler traces for LLM inference at forward-pass, layer, and kernel granularity. Outputs timing tables with anchor kernels and layer boundaries for Perfetto navigation.

4 files

ai-infra-auto-driven-skills

ml-inference-optimization

Optimizes ML inference latency via model compression, distillation, pruning, quantization, caching strategies, and edge deployment patterns.

3 tools

systems-design

Stats

LanguagePython

Stars542

Forks49

MaintenanceExcellent

Last CommitJun 8, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Model Compute Simulation

Overview

Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.

Confirmation Required

Before running a simulation, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Model name	Resolves to config in `model-config-index.json`; determines entire architecture	Ask user or infer from trace context	— (required)
Config accuracy	Indexed values may differ from actual serving config (e.g. `routed_expert_intermediate_size`, `compress_ratios`)	Ask user to provide `config.json` or verify key params against HuggingFace	Use indexed values with a caveat
GPU type	Determines peak FLOPS for MFU denominator	Ask user	— (required for MFU)
dtype (bf16 / fp8)	Affects peak FLOPS selection; fp8 doubles peak	Ask user	bf16
Batch size & seq len	Directly affects FLOPs and tensor shapes	Ask user	B=1, S=1 (decode)
TP / DP / EP	TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs	Ask user	TP=8, DP=1, EP=8
Measured latency (ms)	Required for MFU numerator; must be per-GPU forward-pass wall-clock	Ask user or extract from a profiler trace	— (optional, no MFU without it)

If the model is not in model-config-index.json, ask the user for a config.json path or add an indexed config before running estimates.

Workflow

Step 1: Load model config

Resolve the model name and load its configuration parameters:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models

The script resolves the model name against references/model-config-index.json, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).

If the model is not indexed, tell the user to provide a config.json path or request an index update.

Step 2: Generate execution flow and tensor dimensions

Run the simulator with batch size, sequence length, and parallelism configuration:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16

The simulator prints:

Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
Attention vs MoE/FFN FLOPs proportion per layer
Total model FLOPs for a single forward pass

For decode: use --seq-len 1. For prefill: use --seq-len <prompt_length>.

Step 3: Estimate MFU with measured latency

Provide the measured forward-pass latency to compute MFU:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0

MFU = theoretical_min_time / measured_time × 100%

The simulator prints:

Overall MFU
Per-layer MFU (uniform layer-time assumption)
Per-operator FLOPs proportion (for identifying which ops dominate)

GPU peak FLOPS are loaded from references/gpu-specs.json. The bundled hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM 180GB. Use aliases such as --gpu h100, --gpu h200, or --gpu b200 when running on those local boxes.

Step 4: Per-operator MFU with kernel-level latency

When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.

Method A: `--kernel-flow` (kernel-level MFU, recommended)

Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json

The --kernel-flow parameter accepts a JSON string or @file path. It produces a kernel-level MFU table that preserves all kernel rows from the compute flow and adds:

Mapped Op: which operator this kernel maps to
FLOPs: operator's total FLOPs
Theo(us): theoretical minimum time
MFU%: measured FLOPs utilization
shape_in→shape_out: operator tensor dimensions

When --kernel-flow is provided, the static per-operator template is omitted because the kernel-level MFU table already carries per-kernel shape and FLOPs information. The output keeps the model summary, serving configuration, total FLOPs, and kernel-level MFU table.

Mapping rules:

Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A

FP8 kernel MFU correction: Kernels in categories moe (fused_moe_kernel) and gemm_fp8 use fp8 math internally even when --dtype bf16 is specified. For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS (2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a superscript ⁸ (for example, 63.7%⁸) to show that the fp8 denominator was used. gemm_bf16 kernels still use the bf16 peak FLOPS denominator.

Method B: `--kernel-detail` (operator-level MFU, legacy)

Same input as --kernel-flow but outputs an operator-level summary table (aggregated by operator, not per-kernel). Use when you want a compact view.

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'

The --kernel-ms parameter accepts a JSON object mapping kernel category names to their measured durations in milliseconds. It uses FLOPs-proportional distribution across entire categories, which is less precise than --kernel-detail because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.

Output includes:

Model architecture summary (layers, hidden_size, attention_type, MoE config)
Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
Overall and per-layer MFU

When To Use It

when you need compute-level detail for a known model or config
when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
when the user asks about MFU and can provide measured forward-pass latency
when comparing compute profiles across different parallelism configurations

Useful Commands

List known model IDs:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models

List known GPU types:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus

Emit JSON for automation:

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json

Reporting Checklist

Include:

Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
Per-layer compute flow (showing first representative layer in detail):
- Operator sequence table: name, category, FLOPs, shape_in → shape_out
- Attention vs MoE/FFN FLOPs proportion
Total model FLOPs for a single forward pass
Kernel-level MFU table (when --kernel-flow provided):
- Preserves ALL kernel rows from the compute flow (never deleted)
- Per-kernel columns: # | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
- Direct-match kernels: show mapped operator FLOPs/MFU
- Overhead kernels: show N/A for FLOPs/MFU, row preserved
Operator-level MFU table (when --kernel-detail or --measured-ms provided):
- Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel category → operator mapping explained
Overall MFU and per-layer MFU
One-line summary: dominant compute category, MFU status, key bottleneck

Trace-Based Validation (extract_compute_flow_from_trace.py)

Use scripts/extract_compute_flow_from_trace.py to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.

# Extract compute flow from a trace
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
  --input /path/to/trace.json.gz --format text

# Compare trace against static template
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
  --input /path/to/trace.json.gz \
  --compare qwen3-235b-a22b \
  --batch-size 1 --seq-len 1 --tp 8 --ep 8

Compute Flow Confirmation Hierarchy

When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:

Static template (model_compute_simulator.py + model-config-index.json) — fast, covers known models
Trace extraction (extract_compute_flow_from_trace.py) — validates template against real execution
Inference framework source code — when trace is insufficient (missing Input Dims, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
- SGLang: python/sglang/srt/models/<model_name>.py — contains the forward() method with the exact operator sequence, tensor shapes, and parallelism split logic
- vLLM: vllm/model_executor/models/<model_name>.py
- TensorRT-LLM: cpp/tensorrt_llm/pyexecutor/py_executor.cpp + model config files
When consulting framework source, focus on:
- The forward() method: operator call order and residual connections
- QKV / O projection: whether LoRA-style down/up projections are used (q_lora_rank, o_lora_rank)
- MoE routing: top-k selection, shared vs routed expert split
- TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
- Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
Action: If the framework source reveals discrepancies with the static template, update model-config-index.json and/or build_layer_ops() accordingly.

Limitations of Trace Extraction

Limitation	Detail	Workaround
`record_shapes=True` required	Trace must be captured with shape recording enabled; without it, `Input Dims` fields are absent and FLOPs cannot be computed	SGLang live capture and vLLM `torch_profiler_with_stack=true` already enable this; TensorRT-LLM requires a `py_executor.py` override adding `record_shapes=True`
CUDA Graph mode	During graph replay, `cpu_op` events may only appear once (at capture time); shape information for replayed iterations is not re-recorded	The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage
TP-sliced dimensions	Trace shows post-TP-split dimensions (e.g. `H/TP`), not the full-model view	Use `--tp` in `--compare` mode to scale trace FLOPs back to full-model equivalents
Scope attribution quality	Python scope depends on `with_stack=True`; some frameworks or compiled paths may produce shallow or missing scope chains	Graceful degradation: ops with unresolved scope are categorized as "other"
Not a replacement for static templates	Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path	Use trace extraction to verify templates for new models, then update `model-config-index.json` if discrepancies are found

References

references/model-config-index.json: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
references/gpu-specs.json: GPU peak FLOPS specifications for MFU calculation.
scripts/extract_compute_flow_from_trace.py: trace-based compute flow extraction and template validation tool.

model-compute-simulation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

model-compute-simulation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Model Compute Simulation

Overview

Confirmation Required

Workflow

Step 1: Load model config

Step 2: Generate execution flow and tensor dimensions

Step 3: Estimate MFU with measured latency

Step 4: Per-operator MFU with kernel-level latency

Method A: --kernel-flow (kernel-level MFU, recommended)

Method B: --kernel-detail (operator-level MFU, legacy)

When To Use It

Useful Commands

Reporting Checklist

Trace-Based Validation (extract_compute_flow_from_trace.py)

Compute Flow Confirmation Hierarchy

Limitations of Trace Extraction

References

Similar Skills

Help us improve

Model Compute Simulation

Overview

Confirmation Required

Workflow

Step 1: Load model config

Step 2: Generate execution flow and tensor dimensions

Step 3: Estimate MFU with measured latency

Step 4: Per-operator MFU with kernel-level latency

Method A: --kernel-flow (kernel-level MFU, recommended)

Method B: --kernel-detail (operator-level MFU, legacy)

When To Use It

Useful Commands

Reporting Checklist

Trace-Based Validation (extract_compute_flow_from_trace.py)

Compute Flow Confirmation Hierarchy

Limitations of Trace Extraction

References

Method A: `--kernel-flow` (kernel-level MFU, recommended)

Method B: `--kernel-detail` (operator-level MFU, legacy)

Method A: `--kernel-flow` (kernel-level MFU, recommended)

Method B: `--kernel-detail` (operator-level MFU, legacy)