Help us improve
Share bugs, ideas, or general feedback.
From kernel-opt-agent
Profiles CUDA/CUTLASS/CuTe DSL/Triton GPU kernels: checks environment, validates correctness, collects Nsight Compute metrics, and classifies bottlenecks (memory/compute/latency/occupancy/mixed bound).
npx claudepluginhub fmh66/kernel-opt-agent --plugin kernel-opt-agentHow this skill is triggered — by the user, by Claude, or both
Slash command
/kernel-opt-agent:kernel-profileThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to check environment readiness, validate kernel correctness, and collect Nsight Compute profiling artifacts for bottleneck diagnosis.
Profiles GPU kernels with Nsight Compute, exports metrics/source/PM-sampling reports, compares baseline vs candidate, classifies stalls, and produces one actionable kernel edit.
Benchmarks custom GPU kernels (CUDA-C++, CUTLASS, CuTe DSL, Triton) against PyTorch eager/torch.compile/FlashInfer baselines, validates correctness, and generates a benchmark report.
Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.
Share bugs, ideas, or general feedback.
Use this skill to check environment readiness, validate kernel correctness, and collect Nsight Compute profiling artifacts for bottleneck diagnosis.
env/scripts/env_check.py writes env_check.md.env/scripts/enc_config.py.ref.py must define reference(**kwargs).atol and rtol override default tolerances.cuda-cpp, cute-dsl, cutlass, or triton.cuda-cpp and cutlass kernels must expose extern "C" void solve(...) from a compiled shared library.cute-dsl and triton modules must define setup(**kwargs) and run_kernel(**kwargs).scripts/scripts.md for full usage and options.scripts/ncu_profile.py.
scripts/scripts.md for full usage and options.ncu_summary.md first, then consult reference/NCU.md for metric interpretation.Keep the same dimensions, seed, implementation, GPU, and pointer sizing across versions when comparing kernels.
| Implementation | Input | Profiling ABI | Notes |
|---|---|---|---|
cuda-cpp | .cu + compiled .so | extern "C" void solve(...) | Default for non-.py files |
cutlass | .cu + compiled .so | extern "C" void solve(...) | CUTLASS code must be wrapped by solve(...); compile with required CUTLASS include/library flags |
cute-dsl | .py | setup(**kwargs) + run_kernel(**kwargs) | Pass --implementation=cute-dsl explicitly |
triton | .py | setup(**kwargs) + run_kernel(**kwargs) | Default for .py files |
--backend is kept as a compatibility alias for --implementation. Prefer --implementation in new commands.
Run before correctness and profiling:
python env/scripts/env_check.py -o profile_out/env_check.md --gpu 0
Base required checks include PyTorch import, CUDA runtime availability, selected GPU, ncu, and nsight-python.
At least one implementation backend must be ready before profiling:
| Implementation | Readiness requirement |
|---|---|
cuda-cpp | nvcc executable |
cute-dsl | importable cutlass.cute Python package |
cutlass | nvcc executable plus CUTLASS C++ headers |
triton | importable triton Python package |
For CUTLASS header detection, set CUTLASS_PATH, CUTLASS_ROOT, or CUTLASS_HOME to the CUTLASS root when it is not in a standard location.
For more stable performance data, lock GPU SM clocks when permitted by the system:
python env/scripts/enc_config.py --gpu 0
If either step fails, fix the environment before collecting NCU profiles. See env/ENV.md for the detailed command reference.
Validates kernel output against a Python reference. See scripts/scripts.md for detailed command examples and the full options table.
Script: scripts/correctness_check.py
Output: <output-dir>/correctness.md
Collects Nsight Compute metrics. Run only after correctness passes. See scripts/scripts.md for detailed command examples and the full options table.
Script: scripts/ncu_profile.py
Outputs: ncu_summary.md, ncu_details.md
Use ncu_summary.md as the primary evidence:
| Condition | Classification |
|---|---|
| Memory SOL > 60% and much higher than SM SOL | Memory-Bound |
| SM SOL > 60% and much higher than Memory SOL | Compute-Bound |
| Both SM SOL and Memory SOL < 40% | Latency-Bound |
| Achieved occupancy is far below theoretical occupancy with a clear resource limit | Occupancy-Bound |
| No single dominant symptom | Mixed |
Secondary signals:
| Symptom | Likely issue |
|---|---|
| Global Load/Store Efficiency < 100%, Sectors/Request > 1 | Uncoalesced or misaligned memory access |
| L1/L2 hit rate too low | Poor locality or working set too large |
| Shared memory efficiency low or bank conflicts high | Shared memory bank conflicts |
| Issue slot utilization < 50% | Scheduler/compute underutilization |
| Eligible warps per cycle low | Not enough schedulable work, ILP, or occupancy |
| Register spill > 0 | Register pressure causing local memory traffic |
| Stall Long Scoreboard high | Global memory latency |
| Stall Barrier high | Synchronization overhead |
| Branch efficiency < 100% or divergent branches high | Warp divergence |
For complete metric definitions and category-specific interpretation, read reference/NCU.md only when needed.