Help us improve
Share bugs, ideas, or general feedback.
From ai-infra-auto-driven-skills
Inspects LLM torch profiler traces at forward-pass, layer, and kernel level. Outputs timing tables for Perfetto navigation and layer-level analysis.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:llm-pipeline-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when a whole-trace profiler summary is too coarse. The scripts read a
Analyzes torch.profiler traces for sglang, vLLM, and TensorRT-LLM. Inspects existing traces or drives live profiling, returning kernel, overlap, and fuse-pattern tables.
Analyzes Huawei Ascend NPU profiling data to detect performance anomalies (bubbles, wait-anchor, AICPU exposure) and reverse-engineers a model architecture report from profiling traces.
Analyzes GPU performance from NVIDIA Nsight Systems profiles (.sqlite/.nsys-rep) to identify bottlenecks, NCCL slowdown, MFU/efficiency, and more.
Share bugs, ideas, or general feedback.
Use this when a whole-trace profiler summary is too coarse. The scripts read a Chrome-trace JSON file, find layer-boundary anchor kernels, group kernels into forward passes and layers, and print timing tables you can use for Perfetto navigation or detailed timing analysis.
compress_ratios like DeepSeek-V4 NSA)Before running scripts, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Determines which config.json to use; affects layer classification | Ask user | — (required) |
| Model profile | Determines anchor kernel, blocks-per-layer, and kernel classification rules | Ask user or auto-infer from config | Auto-inferred from config |
config.json path | Provides compress_ratios, num_hidden_layers, num_hash_layers etc. | Ask user or search filesystem | — (required) |
| GPU type | Optional context for reports and hardware notes | Ask user | — |
| TP / EP | Parallelism config affects kernel naming and AllReduce count | Ask user or infer from trace filename (e.g. TP-0) | TP=8, EP=8 |
| Serving mode | Decode vs prefill changes kernel mix and FLOPs profile | Ask user | decode B=1 |
If the user cannot provide config.json, search common locations such as
/root/workspace/*/config.json and the HuggingFace cache. If it is still not
available, require an explicit --profile.
Scripts use ModelProfile to determine layer boundary detection and kernel
classification. Profiles are auto-inferred from config.json or selected
via --profile:
| Profile | Anchor kernel | Blocks/layer | Layer structure | Auto-infer condition |
|---|---|---|---|---|
dsv4_csa_hca | mhc_post_tilelang | 2 | attn + ffn halves | compress_ratios non-empty |
dsv3_mla | flash_fwd_mla_combine | 1 | full layer | kv_lora_rank > 0 |
generic | auto-detect or --anchor-kernel | 1 | full layer | fallback |
Use --profile generic --anchor-kernel YOUR_KERNEL for models not covered
by built-in profiles.
torch.profiler trace in Chrome-trace JSON format (.json or .json.gz)config.json (for profile inference, compress_ratios, etc.)--anchor-kernel)The scripts use an anchor kernel as a layer-boundary marker. The anchor and layer structure are determined by the active ModelProfile.
For example, with the dsv4_csa_hca profile, each transformer layer produces
2 consecutive mhc_post_tilelang calls:
mhc_post_tilelang ← end of attn half (attention + O-proj + AllReduce)
... ffn computation ...
mhc_post_tilelang ← end of ffn half (MoE experts + AllReduce)
... next layer attn ...
mhc_post_tilelang ← next layer's attn boundary
So for N layers with the dsv4_csa_hca profile, one forward pass has 2N
anchor blocks. With dsv3_mla or generic, each layer has 1 block.
Forward pass P starts at block index P * (N * blocks_per_layer).
layer_timeline_analyzer.py — Per-layer timeline and cluster stats# Show all forward passes summary (cold-start vs steady-state)
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--show-all-passes
# Detailed per-layer breakdown for a specific forward pass
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5
# Auto-select first steady-state pass
python3 scripts/layer_timeline_analyzer.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json
The script prints:
layer_kernel_breakdown.py — Per-layer kernel detail and compute flow# Single layer kernel dump
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3
# Compute flow format (with model architecture summary and category column)
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3 --format compute-flow
# JSON export
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 3 --format json
# Compare two layers side-by-side
python3 scripts/layer_kernel_breakdown.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layer 2 --compare-layer 3
Output formats:
--format text (default): grouped summary + top hot kernels ranked by duration, with simplified names and percentages--format compute-flow: model architecture summary + per-kernel hotness table with Category, %, and ts_rel(ms) columns--format json: machine-readable per-kernel detail ranked by durationperfetto_time_mapper.py — Perfetto UI time navigation# Show all forward pass time ranges in Perfetto
python3 scripts/perfetto_time_mapper.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json
# Layer-level time ranges for a specific forward pass
python3 scripts/perfetto_time_mapper.py \
--trace /path/to/TP-0.trace.json.gz \
--config /path/to/config.json \
--fwd-pass 5 --layers 2,3,38,42
The script prints:
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --show-all-passes
Read the "all-passes" table. The first pass is cold-start (few tokens). Find the first pass where layer-0 wall-clock stabilizes (typically pass 3-5).
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --fwd-pass 5
Identify:
Select 1-2 representative layers (one per bottleneck type), then:
# Human-readable compute flow table
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 3 --format compute-flow
# JSON export
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
The --format compute-flow output includes:
# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dimsdur(us) descending by default; use ts_rel(ms) to jump back to the kernel's trace location.python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 2 --compare-layer 3
This shows the exact kernel difference between the two layer types.
python3 scripts/perfetto_time_mapper.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layers 2,3,38,42
Use the printed time ranges to navigate directly in Perfetto.
The scripts classify layers based on config.json fields:
| Config field | Value | Layer Type | Description |
|---|---|---|---|
compress_ratios[i] | 0 | FULL_ATTN | No NSA compression (layers 0-1) |
compress_ratios[i] | 4 | C4_LIGHT | C128 sparse attention, fastest |
compress_ratios[i] | 128 | C128_HEAVY | C4 attention + Hadamard + Indexer, bottleneck |
i >= N - num_hash_layers | — | HASH | Hash-table routing with paged MQA |
i == 0 | — | FIRST | First layer (empty KV cache) |
i == N - 1 | — | FINAL | Final layer (lm_head output) |
Kernels are classified by the active ModelProfile's rules. Categories marked
with (DSv4) are specific to the dsv4_csa_hca profile; all profiles include
the universal categories.
| Category | Match Pattern | Profile | Typical Share (DSv4) |
|---|---|---|---|
| ★ MLA Attention | flash_fwd_splitkv_mla | DSv4, DSv3 | 21-33% |
| ★ MoE Fused | fused_moe_kernel | DSv4, DSv3 | 11-17% |
| ● NCCL AllReduce | AllReduce | universal | 5-8% |
| GEMM fp8 | deep_gemm | universal | 12-25% |
| GEMM bf16 | nvjet | universal | 11-13% |
| Hadamard Xform | hadamard | DSv4 | 0-2.4% |
| Indexer Cache | indexer | DSv4 | 0-0.1% |
| Paged MQA | paged_mqa_logits | DSv4 | 0-1.8% |
| MHC | mhc_pre_gemm_sqrsum, mhc_pre_big_fuse, mhc_post_tilelang | DSv4 | 10-15% |
| C4/C128 Prefill | c4_prefill, c128_prefill | DSv4 | 0-0.3% |
| RMSNorm | RMSNorm, rms_normalize | universal | 1-2% |
| FP8 Quant | quant, Quant | universal | 1-2% |
| TopK | topk | universal | 0-0.7% |
| RoPE | deepseek_rope, fused_norm_rope | DSv4, DSv3 | 1-2% |
| Activation | silu_mul_clamp, act_and_mul | universal | 0-0.5% |
| Other | — | universal | 2-5% |
Include:
config.json):
layer_timeline_analyzer.py --show-all-passes):
layer_timeline_analyzer.py --fwd-pass N):
layer_kernel_breakdown.py --format compute-flow# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dimsdur(us) descending) by default--format json)