Help us improve
Share bugs, ideas, or general feedback.
From ai-infra-auto-driven-skills
Analyzes torch.profiler traces for sglang, vLLM, and TensorRT-LLM. Inspects existing traces or drives live profiling, returning kernel, overlap, and fuse-pattern tables.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:llm-torch-profiler-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill for `torch.profiler` analysis across:
references/fuse-overlap-catalog.mdreferences/heuristics.mdreferences/overlap-catalog.mdreferences/source-map.mdreferences/vllm-torch-compile-fusions.mdscripts/analyze_llm_torch_profile.pyscripts/analyze_sglang_torch_profile.pyscripts/make_trtllm_py_executor_override.pyscripts/probe_llm_server.pyscripts/profile_common.pyscripts/render_triage_markdown_bundle.pyscripts/run_llm_single_model_matrix_host.shscripts/run_sglang_torch_profile_host.shscripts/run_trtllm_pytorch_profile_host.shscripts/run_vllm_torch_profile_host.shscripts/triage_kernel_helpers.pyscripts/triage_overlap_helpers.pyInspects LLM torch profiler traces at forward-pass, layer, and kernel level. Outputs timing tables for Perfetto navigation and layer-level analysis.
Analyzes Huawei Ascend NPU profiling data to detect performance anomalies (bubbles, wait-anchor, AICPU exposure) and reverse-engineers a model architecture report from profiling traces.
Analyzes GPU performance from NVIDIA Nsight Systems profiles (.sqlite/.nsys-rep) to identify bottlenecks, NCCL slowdown, MFU/efficiency, and more.
Share bugs, ideas, or general feedback.
Use this skill for torch.profiler analysis across:
sglangvllmTensorRT-LLMThere is only one public workflow:
triagePreferred unified entrypoint:
Backwards-compatibility shim (kept so older docker exec ... analyze_sglang_torch_profile.py ... calls keep working; it just forwards to the unified entrypoint):
Markdown bundling helper:
triage always prints the same three tables:
By default, all three tables only render rows at or above 1.0% cumulative GPU-time share.
Rows below that are hidden by default unless the user asks for a lower cutoff.
Keep the fuse-pattern table source-backed and deterministic. Do not turn it into a fuzzy matcher.
If exact source-backed matching is weak but a kernel cluster is still close to a known family, add one short note after the tables with exactly one of:
highmediumlow| Capability | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| Existing trace triage | yes | yes | yes |
| Single-trace live capture | yes | yes, if torch profiler is enabled on server | requires profiler control endpoints |
| Two-trace mapping+formal triage | yes | yes | yes |
| Stage-separated live workload | yes | yes | yes, with a writable shared trace dir or per-stage host runner |
--profile-by-stage capture | yes | no | no |
--profile-prefix control | yes | usually ignored on HTTP profiler route | usually ignored on HTTP profiler route |
For TensorRT-LLM, live capture only works when the server exposes /start_profile and
/stop_profile, and when the deployment already provides a shared trace path plus the
required env vars.
The current reference run is the 4x H100 matrix captured on 2026-04-23 on
h100_sglang under:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3Rendered markdown bundle:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.mdValidated model directories:
mixtral_8x7b_instructqwen2_5_32b_instructqwen3_32bEach model directory contains:
analysis_sglang.txtanalysis_vllm.txtanalysis_trtllm.txtValidated matrix:
| Model | SGLang | vLLM | TensorRT-LLM | Result |
|---|---|---|---|---|
mistralai/Mixtral-8x7B-Instruct-v0.1 | 4x H100 | 4x H100 | 4x H100 | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
Qwen/Qwen2.5-32B-Instruct | 4x H100 | 4x H100 | 4x H100 | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
Qwen/Qwen3-32B | 4x H100 | 4x H100 | 4x H100 | three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted <think> prefixes |
Use this run as the main H100 reference.
The older 2026-04-22 single-card Qwen3 matrix is still useful for bring-up, but it is
not the default reference anymore.
Stage-separated workload validation captured on 2026-05-01 on h100_sglang:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260501_stage_split_validation/data/bbuf/validate/unified_llm_profiler_skill/runs/20260501_stage_split_validation_largeValidated models:
| Model | GPU | Workloads | Result |
|---|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct | 1x H100 | prefill 4090->1, decode 1->2048 | generated separate prefill/*.trace.json.gz and decode/*.trace.json.gz; kernel, overlap, and fuse tables rendered with separate extend/prefill and decode sections |
Qwen/Qwen2.5-1.5B-Instruct | 1x H100 | prefill 4090->1, decode 1->2048 | generated separate prefill/*.trace.json.gz and decode/*.trace.json.gz; kernel, overlap, and fuse tables rendered with separate extend/prefill and decode sections |
Qwen/Qwen2.5-7B-Instruct | 1x H100 | prefill 4090->1, decode 1->2048 | generated separate traces; prefill kernel table captured 28-layer GEMM/FA3/RMSNorm work, decode captured 5-step graph launches, and fuse rows were split by stage |
Qwen/Qwen2.5-14B-Instruct | 1x H100 | prefill 4090->1, decode 1->2048 | generated separate traces; prefill kernel table captured 48-layer GEMM/FA3/RMSNorm work, decode captured 5-step graph launches, and fuse rows were split by stage |
Qwen/Qwen3-8B | 2x H100, TP=2 | prefill 4090->1, decode 1->2048, warmup 10/capture 5 | generated separate prefill/decode traces and all three tables; unique probe prompts avoided prefix-cache pollution in the prefill table |
mistralai/Mistral-7B-Instruct-v0.3 | 2x H100, TP=2 | prefill 4090->1, decode 1->2048, warmup 10/capture 5 | generated separate prefill/decode traces and all three tables; server logs showed no repeated-prompt prefix-cache shortcut during the active prefill window |
This validation also covers the compatibility fix for older SGLang profiler
state machines: workload-separated live capture labels stages by output
directory and avoids nesting SGLang's internal profile_by_stage state machine
inside each workload. The helper
adds one internal scheduler guard step because SGLang increments forward_ct
before checking whether the profiler should stop; without that guard, a
num_steps=1 prefill capture can stop just before the actual prefill forward.
The 2026-05-01 two-card validation artifacts for the additional models are:
/data/bbuf/validate/core_skill_validation_20260501/qwen3_8b/profiler/data/bbuf/validate/core_skill_validation_20260501/mistral_7b_instruct_v03/profilerTo render a validated run into one markdown document:
python3 scripts/render_triage_markdown_bundle.py \
--analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
--output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
The bundle groups by model and keeps the three tables for each framework.
H100 notes:
extend/prefill and decode sections when the trace contains a clean stage splitsglang.profilerMixtral-8x7B-Instruct-v0.1, Qwen3-32B, and nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8--output-dir to match the server torch_profiler_dir; the validated H100 flow uses --profiler-config {"profiler":"torch","torch_profiler_dir":"..."} and then drives /start_profile and /stop_profile--backend pytorch; the H100 flow writes the trace with TLLM_TORCH_PROFILE_TRACE and then analyzes the saved trace7021547 on 2026-05-15; the newer commits did not touch profiling/server code, so the b9e1945 profiler evidence still applies: PyTorch profiling uses record_shapes=True and with_modules=True, but not with_stack=True; keep the override path for table-quality Python locations unless the target image proves otherwise/data/..., not /home/...torch.profiler trace or profile directory from sglang, vllm, or TensorRT-LLMFor diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipelinestop the workflow instead of analyzing the trace. Handle it as a backend-selection issue, not as native-kernel profiler evidence.
Live capture must not use one mixed prompt as the default.
By default, analyze_llm_torch_profile.py --url ... captures two labeled
workloads and then renders the same three tables with separate stage sections:
4090, output length 11, output length 2048Every live profiler path warms up 10 steps before arming the profiler and then
captures 5 active steps by default. Keep this warmup/active split aligned
across SGLang, vLLM, and TensorRT-LLM before comparing kernel tables.
Use these options to override the contract when the benchmark workload is known:
--profile-workload both \
--warmup-steps 10 --num-steps 5 \
--prefill-input-len 4090 --prefill-output-len 1 \
--decode-input-len 1 --decode-output-len 2048
Allowed --profile-workload values:
both: default; capture prefill and decode separatelyprefill: capture only the long-input / one-token workloaddecode: capture only the one-input / long-output workloadlegacy: keep the old --probe-prompt / --probe-max-new-tokens behaviorFor sglang-sota-humanize-loop, do not use the defaults if the slow SGLang
benchmark scenario has a known input/output distribution.
Set the profiler lengths from that slow scenario instead: prefill uses the slow
input length with output 1, and decode uses input 1 with the slow output
length. For a mixed dataset, profile the slowest representative bucket such as
the p50 or p95 input/output pair used in the benchmark report, and record the
bucket in the artifact notes.
python3 scripts/analyze_llm_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gz
Use this when one trace is enough. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.
python3 scripts/analyze_llm_torch_profile.py \
--framework sglang \
--url http://127.0.0.1:30000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
--num-steps 5 \
--warmup-steps 10 \
--profile-by-stage \
--profile-workload both
The script sends POST /start_profile to the SGLang server directly.
Keep --output-dir under /data/... so later analysis and docs can see the trace.
The script writes server_args.json, warms up with the same workload shape,
sends the active probe requests after profiling is armed, captures separate
prefill/ and decode/ profile roots by default, and waits longer for trace
flush than the earlier implementation.
For the default workload-separated capture, the directory name labels the stage
and the SGLang internal profile_by_stage mode is not used inside each
workload. This avoids mixing a one-token prefill probe with a separate decode
profile. The helper still adds one internal guard step because older SGLang
profilers check the target counter before running the next forward.
Launch vLLM with torch profiler enabled, for example:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'
Then run:
python3 scripts/analyze_llm_torch_profile.py \
--framework vllm \
--url http://127.0.0.1:8000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
--num-steps 5 \
--warmup-steps 10 \
--no-profile-by-stage \
--profile-workload both
For vLLM, --output-dir must point to the same torch_profiler_dir the server uses.
The current vLLM profiler config already defaults torch_profiler_with_stack=true,
so the runner only needs to set torch_profiler_dir.
On h100_sglang, external vLLM containers should mount both:
/data/.cache/huggingface:/root/.cache/huggingface/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skillUse this only when the server exposes POST /start_profile and POST /stop_profile,
and the trace path is shared with the current machine.
Typical env expectations are:
TLLM_PROFILE_START_STOP=<start>-<stop> such as 10-20TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json or .json.gzThen run:
python3 scripts/analyze_llm_torch_profile.py \
--framework trtllm \
--url http://127.0.0.1:8000 \
--output-dir /shared/path \
--num-steps 5 \
--no-profile-by-stage \
--profile-workload both
If the deployment does not expose the profiler control endpoints, fall back to analyzing
an existing trace instead of trying live capture.
If the TensorRT-LLM trace output is configured as one fixed file path, use
scripts/run_trtllm_pytorch_profile_host.sh --stage prefill and --stage decode
instead of direct --profile-workload both, so each stage gets its own trace file.
On the current TensorRT-LLM mainline path, py_executor.py creates the torch profiler
with record_shapes=True and with_modules=True but not with_stack=True.
For table-quality validation, use the override generator:
python3 scripts/make_trtllm_py_executor_override.py \
--source /path/to/original/py_executor.py \
--output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.py
The matrix runner does this automatically on H100 before TensorRT-LLM capture starts.
This is the validated TensorRT-LLM flow on h100_sglang:
trtllm-serve with TLLM_PROFILE_START_STOP=<start>-<stop> and TLLM_TORCH_PROFILE_TRACE=/data/.../trace.json--input /data/.../trace.jsonpython3 scripts/analyze_llm_torch_profile.py \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dir
Use this when you need stronger overlap attribution and kernel-to-source mapping.
python3 scripts/analyze_llm_torch_profile.py \
--framework sglang \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stage
For vllm or TensorRT-LLM, use the same shape but pass:
--framework vllm or --framework trtllm--mapping-output-dir ...--formal-output-dir ...--no-profile-by-stageprofile_by_stage--profile-by-stage is only meaningful on the SGLang live-capture path.
--profile-workload both / prefill / decode, workload directories
are the stage labels; the live-capture helper disables SGLang's internal
stage profiler per workload, warms up first, and captures the requested
active step count for the selected workload.profile_by_stage is
still useful because prefill and decode usually have very different
bottlenecks.profile_by_stage.vllm and TensorRT-LLM, disable it with --no-profile-by-stage.Use when you want the lowest-friction report:
Prefer this by default.
Use when you need:
Do not call the mapping pass a "fast profile".
It exists to recover kernel -> cpu_op -> python scope.
--profile-workload both; use legacy only when reproducing an old trace contract.--profile-by-stage
mainly for legacy or manually collected traces.h100_sglang, create or clean the target trace directory through docker exec sglang_bbuf ... so the path is definitely writable under /data.triage for the three-table report.PR-backed / in-flight sections. Prefer reporting:
high, medium, or low only.
Base that note on the full pattern shape, not on one kernel name alone.
Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure.
Do not rewrite the script table itself to include these heuristic judgments.Load these only when needed:
Return:
high / medium / low when exact matching is inconclusive