Help us improve
Share bugs, ideas, or general feedback.
From humanize
Profiles GPU kernels with Nsight Compute, exports metrics/source/PM-sampling reports, compares baseline vs candidate, classifies stalls, and produces one actionable kernel edit.
npx claudepluginhub bbuf/kernel-pilot --plugin humanizeHow this skill is triggered — by the user, by Claude, or both
Slash command
/humanize:ncu-reportThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when benchmark numbers are not enough and the next CUDA,
Profiles CUDA/CUTLASS/CuTe DSL/Triton GPU kernels: checks environment, validates correctness, collects Nsight Compute metrics, and classifies bottlenecks (memory/compute/latency/occupancy/mixed bound).
Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.
Analyzes GPU performance from NVIDIA Nsight Systems profiles (.sqlite/.nsys-rep) to identify bottlenecks, NCCL slowdown, MFU/efficiency, and more.
Share bugs, ideas, or general feedback.
Use this skill when benchmark numbers are not enough and the next CUDA, Triton, CUTLASS/CuTe, CuTe DSL, or PTX kernel edit should be driven by Nsight Compute evidence.
The rule is:
profile first, diagnose second, optimize third
The output is not just "memory-bound" or "compute-bound". It must be an inference chain from measured counters to a likely mechanism to one actionable kernel edit.
Invoke ncu-report when any of these hold:
Do not run NCU while correctness is failing unless the failure is a profiler collection problem. Fix correctness first.
Store artifacts under the standalone optimization repo:
profile-artifacts/<version>/
report.ncu-rep
raw.csv
details.txt
source.csv # when source export is available
sampling.csv # when PM sampling export is available
kernel.ptx # when PTX can be extracted
kernel.sass # when SASS can be extracted
digest.md
When comparing a candidate, include the baseline or parent version path in the digest.
ncu --list-sections shows PmSampling or
PmSampling_WarpStates, add those PM-sampling sections for timeline stalls
and long-tail evidence.digest.md using the template below. The final section must contain
exactly one next edit.When the digest points to an edit family but the exact implementation is not
obvious, turn the measured symptom into a kernel-knowledge query. Use the
counter name, bottleneck class, target architecture, DSL, and operator as
search terms; then ground any borrowed implementation idea in PR/source
evidence before editing code.
Useful mappings:
| Digest signal | Knowledge query shape |
|---|---|
| Long scoreboard, poor bytes/sector | memory-bound, vectorized-loads, cache policy, layout/coalescing PRs |
| Shared bank conflicts or short scoreboard | shared-memory, swizzling, bank-conflicts, TMA/ldmatrix examples |
| Barrier, membar, mbarrier, TMA wait | pipeline-stalls, mbarrier, pipeline-stages, producer/consumer split |
| Low tensor pipe on GEMM/attention | tcgen05 or wgmma, warp specialization, tile shape, epilogue fusion |
| Tail waves or uneven block lifetimes | tail-effect, persistent kernels, CLC, tile scheduling |
| I-cache, no-instruction, codegen pressure | PTX/SASS, inline PTX, specialization/splitting examples |
Load examples.md for copyable command variants: baseline capture, focused regex capture, PyTorch extension profiling, PM sampling/source export, PTX/SASS hotspot analysis, report comparison, and script-assisted digest generation.
Minimal focused capture:
mkdir -p profile-artifacts/v000_baseline
ncu --target-processes all \
--kernel-name regex:"<kernel-name-pattern>" \
--launch-skip 5 --launch-count 1 \
--set full --import-source on \
--section SpeedOfLight \
--section SchedulerStats \
--section WarpStateStats \
--section Occupancy \
--section LaunchStats \
--section MemoryWorkloadAnalysis \
--section SourceCounters \
-o profile-artifacts/v000_baseline/report \
python benchmarks/<bench>.py --shape <shape> --dtype <dtype>
ncu --import profile-artifacts/v000_baseline/report.ncu-rep \
--page raw --csv > profile-artifacts/v000_baseline/raw.csv
ncu --import profile-artifacts/v000_baseline/report.ncu-rep \
--page details > profile-artifacts/v000_baseline/details.txt
If a section or page is unavailable on the installed NCU version, record that in the digest and continue with the available sections.
Optional helper:
python3 <skill-root>/scripts/ncu_report_digest.py \
--csv profile-artifacts/v001_candidate/raw.csv \
--baseline-csv profile-artifacts/v000_baseline/raw.csv \
--kernel-regex "<kernel-name-pattern>" \
--output profile-artifacts/v001_candidate/digest.md
The helper is a first pass only. Inspect the report and source/sampling evidence before treating the next edit as final.
Start with these groups, then load metrics.md for the full list and interpretation rules:
smsp__warp_issue_stalled_* reason.Validate exact metric names on the target machine:
ncu --list-sections
ncu --query-metrics | grep -E 'warp_issue_stalled|pipe_tensor|dram__|lts__|bank_conflict|launch__'
### NCU Report Digest: <kernel> @ <version>
Environment
- GPU / arch:
- Driver / CUDA / NCU:
- Repo commit:
- Benchmark command:
- Shape / dtype:
- Baseline or parent report:
- Candidate report:
- Raw CSV:
- Source / sampling exports:
- PTX / SASS dumps:
Headline
- Bottleneck class:
- Dominant stall or hotspot:
- Confidence: High | Medium | Low
- Why this report is valid:
Evidence
| Metric | Baseline | Candidate | Delta | Interpretation |
|---|---:|---:|---:|---|
| <metric> | <value> | <value> | <delta> | <meaning> |
Source / PM-Sampling Hotspots
- <source/PTX/SASS line or phase>: <counter/timeline signal> -> <meaning>
PTX / SASS Analysis
- Hot instruction window:
- Suspected codegen or inline PTX issue:
- Instruction-level edit:
Inference Chain
1. Measured counter:
2. Likely mechanism:
3. Why other hypotheses are weaker:
4. Risk:
Next Concrete Edit
- File:
- Change:
- Validation:
- Expected metric movement:
__restrict__, alignment assertions,
launch bounds, narrower unroll, or a targeted inline PTX sequence..ncu-rep, CSV exports, and digest paths in the attempt ledger.