From agent-almanac
Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound by measuring baseline performance, classifying on the roofline, computing occupancy and compute/load ratio per tile, inspecting SASS instruction mix and stall codes, checking the shared memory cliff, and applying a decision matrix to select the right optimization strategy.
Applies software pipelining (double-buffering) to tiled CUDA GPU kernels to overlap global memory loads with Tensor Core compute. Selects LDG vs cp.async variants, verifies smem occupancy cliffs, and checks SASS overlap for memory-bound kernels.
Analyzes NVIDIA Nsight Systems GPU profiles (.sqlite/.nsys-rep) for bottlenecks, NCCL slowdowns, MFU/efficiency, run comparisons, CUTracer/SASS, and variance using nsys-ai CLI.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Share bugs, ideas, or general feedback.
Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound by measuring baseline performance, classifying on the roofline, computing occupancy and compute/load ratio per tile, inspecting SASS instruction mix and stall codes, checking the shared memory cliff, and applying a decision matrix to select the right optimization strategy.
.cubin or .cu source with build command)Run the kernel with CUDA events (BenchTimer), record time in milliseconds. Calculate effective throughput metrics:
nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common
./bench 4096 4096 4096
effective_gflops = (2 * M * N * K) / (time_ms / 1000) / 1e9effective_bw = total_bytes / (time_ms / 1000) / 1e9effective_gflops = (4 * batch * heads * seq_len^2 * head_dim) / (time_ms / 1000) / 1e9Expected: Baseline numbers: kernel time in ms, effective GFLOPS, and effective bandwidth.
On failure: Check that the kernel launches without error (CHECK_CU macro). Verify warmup runs precede measurement. Ensure problem dimensions are large enough to saturate the GPU (small problems may bottleneck on launch overhead).
Compute arithmetic intensity and compare against the machine balance point to classify the kernel:
AI = FLOPs / bytes_loaded_from_global_memory. Count only unique bytes loaded from DRAM (not shared memory or register reuse).balance = peak_compute / peak_bandwidth.AI < balance, the kernel is memory-bound. If AI > balance, the kernel is compute-bound.GA104 (RTX 3070 Ti) Reference Values:
| Resource | Peak | Unit |
|---|---|---|
| FP32 FFMA | 21.7 | TFLOPS |
| FP16 Tensor Core (HMMA) | 174 | TFLOPS |
| INT8 Tensor Core (IMMA) | 696 | TOPS |
| DRAM Bandwidth | 608 | GB/s |
| L2 Cache | 4 | MB |
| SMs | 48 |
Derived Balance Points:
| Precision | Balance Point (FLOP/byte) |
|---|---|
| FP32 FFMA | 21700 / 608 = 35.7 |
| FP16 TC | 174000 / 608 = 286.2 |
| INT8 TC | 696000 / 608 = 1144.7 |
attained = effective_throughput / peak_throughput. If memory-bound: compare effective bandwidth to 608 GB/s. If compute-bound: compare effective GFLOPS to the relevant peak.Expected: Classification as compute-bound, memory-bound, or latency-bound (low occupancy causing neither compute nor memory saturation) with numerical justification.
On failure: Recheck byte counting. Watch for redundant re-reads (e.g., 9x in direct conv2d without im2col). If neither compute nor memory is saturated, the kernel is likely latency-bound (see Step 3).
Determine active warps per SM from the launch configuration and resource usage:
nvcc --cubin -arch=sm_86 -O2 --resource-usage -o kernel.sm_86.cubin kernel.cu 2>&1 | grep -E 'registers|smem'
warps_per_block = threads_per_block / 32.floor(65536 / (registers_per_thread * threads_per_block))floor(available_smem_per_SM / smem_per_block) -- see Step 6 for clifffloor(48 / warps_per_block) (GA104 max: 48 warps/SM)min(register_limit, smem_limit, warp_limit, block_limit).blocks_per_SM * warps_per_block.Expected: Occupancy table showing blocks/SM, active warps/SM, and the limiting factor (registers, smem, or warps).
On failure: Check cuFuncSetAttribute for dynamic shared memory. Verify --resource-usage reports match the actual launch configuration. If register count is unexpectedly high, try --maxrregcount=N to cap registers (trading register spills for occupancy).
Count compute instructions and load bytes per K-tile from SASS (not source code):
cuobjdump -sass kernel.sm_86.cubin > kernel.sass
grep -c 'HMMA' kernel.sass -- FP16 Tensor Core opsgrep -c 'IMMA' kernel.sass -- INT8 Tensor Core opsgrep -c 'FFMA' kernel.sass -- FP32 fused multiply-addgrep -c 'LDG' kernel.sass -- global memory loadscompute_ops / load_ops per tile.Expected: Compute/load ratio with classification (high/medium/low) and cp.async recommendation.
On failure: Count from SASS disassembly, not source code -- the compiler may fuse, eliminate, or reorder instructions. Ensure you are counting instructions within the inner loop only (the K-tile iteration), not the entire kernel.
Examine the full SASS instruction mix and stall codes:
cuobjdump -sass kernel.sm_86.cubin > kernel.sass
grep -c 'HMMA.16816' kernel.sass # FP16 Tensor Core
grep -c 'IMMA.16816' kernel.sass # INT8 Tensor Core
grep -c 'FFMA' kernel.sass # FP32 fused multiply-add
grep -c 'LDGSTS' kernel.sass # cp.async (global->shared)
grep -c 'LDG' kernel.sass # Global load
grep -c 'STS' kernel.sass # Shared store
grep -c 'LDS' kernel.sass # Shared load
grep -c 'BAR.SYNC' kernel.sass # Barrier synchronization
grep -c 'SHFL' kernel.sass # Warp shuffle (reductions)
grep -c 'MUFU' kernel.sass # Special function unit
grep 'HMMA' kernel.sass | head -5 # Expect S08 minimum (hardware constraint)
grep 'IMMA' kernel.sass | head -5 # Compiler emits S04, reducible to S02 via CuAssembler
grep 'FFMA' kernel.sass | head -5 # Check for S04 (reducible to S01 on independent FFMAs)
Expected: Instruction count table and stall code summary with identified optimization targets.
On failure: Ensure cuobjdump architecture matches the kernel compilation target (both must be sm_86). If SASS output is empty, the cubin may be corrupt -- recompile.
Determine whether shared memory usage crosses the architecture-specific occupancy cliff:
--resource-usage output (Step 3) or cuobjdump --res-usage kernel.sm_86.cubin.Expected: Smem/block value with blocks/SM count and explicit statement of whether the 50 KB cliff is crossed.
On failure: If above cliff and occupancy is the bottleneck, the optimization strategy must change: reduce tile size to get smem under 50 KB, or accept 1 block/SM and compensate with higher compute/load ratio per tile (more register reuse, longer K-tiles).
Synthesize findings from Steps 2-6 into an optimization strategy:
| Condition | Strategy |
|---|---|
| Memory-bound + low compute/load ratio (<5:1) + smem under cliff | Software pipelining with cp.async (LDGSTS). Overlap global loads with compute. |
| Memory-bound + high compute/load ratio (>20:1) + 8+ warps | Warp interleaving already hides latency. Focus on algorithmic changes: implicit GEMM, split-Q, im2col. |
| Compute-bound + FFMA-heavy | CuAssembler stall code tightening: S04 -> S01 on independent FFMAs. |
| Compute-bound + HMMA-heavy | S08 is hardware minimum, cannot reduce. Increase tile reuse (larger M/N tiles, longer K-loop). |
| Compute-bound + IMMA-heavy | CuAssembler: S04 -> S02 on IMMA instructions (compiler is conservative). |
| Latency-bound (low occupancy, neither saturated) | Reduce smem or registers to get more blocks/SM. Get above 8 warps/SM. |
| Smem above cliff | Reduce tile size or restructure to get smem/block under 50 KB (GA104). |
Expected: Ranked list of recommended optimizations with predicted gain range and potential conflicts.
On failure: If no clear winner emerges, run micro-benchmarks isolating each strategy (e.g., test cp.async alone, test reduced tile size alone) to measure actual impact before combining.
Produce a structured bottleneck report:
## Bottleneck Analysis Report: [kernel_name]
### Baseline
- Problem: [dimensions]
- Kernel time: [X] ms
- Effective GFLOPS: [Y] | Effective BW: [Z] GB/s
### Roofline Classification
- Arithmetic intensity: [AI] FLOP/byte
- Balance point: [BP] FLOP/byte ([precision])
- Classification: **[compute|memory|latency]-bound**
- Attained fraction: [X]% of peak
### Occupancy
| Resource | Per Block | Limit/SM | Blocks/SM |
|----------|-----------|----------|-----------|
| Registers | [N]/thread | 65536 | [B] |
| Shared mem | [X] KB | 100 KB (cliff: 50 KB) | [B] |
| Warps | [W] | 48 | [B] |
| **Limiting** | | | **[min(B)]** |
- Active warps/SM: [W] ([sufficient|insufficient] for latency hiding)
### Compute/Load Ratio
- Compute ops/tile: [N] [HMMA|IMMA|FFMA]
- Load bytes/tile: [N] bytes ([N] LDG x [N] bytes)
- Ratio: [X]:1 — **[high|medium|low]**
- cp.async recommendation: [beneficial|neutral|detrimental]
### SASS Instruction Mix
| Instruction | Count | Notes |
|-------------|-------|-------|
| HMMA.16816 | [N] | Stall: S08 (hardware min) |
| IMMA.16816 | [N] | Stall: S04 (reducible to S02) |
| FFMA | [N] | Stall: S04 (reducible to S01) |
| LDG | [N] | |
| LDGSTS | [N] | cp.async |
| BAR.SYNC | [N] | |
### Smem Cliff
- Smem/block: [X] KB — [under|over] 50 KB cliff
- Blocks/SM: [B] — [no occupancy loss|occupancy halved]
### Recommended Optimizations (ranked)
1. [Strategy] — estimated [X-Y]% gain
2. [Strategy] — estimated [X-Y]% gain
3. [Strategy] — estimated [X-Y]% gain
Expected: Complete markdown report consumable by a kernel-optimizer agent or human developer.
On failure: Re-run with different problem sizes (e.g., 1024, 2048, 4096, 8192) to confirm findings are not size-specific. Small problems may appear latency-bound when the real bottleneck at scale is memory bandwidth.
cuobjdump -sass output.pipeline-gpu-kernel -- implement software pipelining with cp.async when analysis identifies a memory-bound kernel with low compute/load ratiosimulate-cpu-architecture -- complementary architecture analysis for CPU-side bottlenecks in host-device workflows