From agent-almanac
Applies software pipelining (double-buffering) to tiled CUDA GPU kernels to overlap global memory loads with Tensor Core compute. Selects LDG vs cp.async variants, verifies smem occupancy cliffs, and checks SASS overlap for memory-bound kernels.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Apply software pipelining (double-buffering) to a tiled GPU kernel so that global memory loads for tile N+1 overlap with Tensor Core computation on tile N. Transform a sequential load-sync-compute-sync K-loop into a prologue/loop/epilogue structure, choose between LDG-register and cp.async (LDGSTS) variants based on compute/load ratio, verify shared memory stays under the architecture occupancy...
Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.
Provides optimized GPU kernels for LLM operations including MoE routing, FP8/FP4 quantization, transpose, engram gating, and Manifold HyperConnection using TileLang on NVIDIA Hopper/Blackwell GPUs.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Share bugs, ideas, or general feedback.
Apply software pipelining (double-buffering) to a tiled GPU kernel so that global memory loads for tile N+1 overlap with Tensor Core computation on tile N. Transform a sequential load-sync-compute-sync K-loop into a prologue/loop/epilogue structure, choose between LDG-register and cp.async (LDGSTS) variants based on compute/load ratio, verify shared memory stays under the architecture occupancy cliff, and confirm load/compute overlap in the final SASS.
analyze-kernel-bottleneck identifies a memory-bound kernel with low compute/load ratio per tile.cu) with a tiled K-loop containing separate load and compute phasesanalyze-kernel-bottleneck; will be estimated if not provided)Confirm the kernel has a tiled K-loop with distinct load and compute phases separated by __syncthreads(). Calculate the doubled shared memory cost and verify it stays under the architecture occupancy cliff.
__syncthreads(), compute (HMMA/IMMA/FFMA) on the shared memory tiles, __syncthreads().smem_a_size = BM * BK * sizeof(T) and smem_b_size = BK * BN * sizeof(T).smem_doubled = smem_a_size * 2 + smem_b_size * 2.Single buffer: smem_a[BM*BK] + smem_b[BK*BN] = 2 KB + 2 KB = 4 KB
Double buffer: smem_a[2][BM*BK] + smem_b[2][BK*BN] = 4 KB + 4 KB = 8 KB
8 KB << 50 KB cliff -> 2 blocks/SM -> 8 warps
num_tiles = K / BK. Pipelining requires num_tiles >= 2 (at least one prologue + one main loop iteration).Expected: A shared memory budget table showing single-buffer and double-buffer costs, confirming the doubled allocation stays under the architecture cliff with at least 2 blocks/SM occupancy.
On failure: If double-buffer exceeds the cliff, reduce tile size (halve BK or BM) until smem_doubled <= 50 KB for GA104. Alternatively, use register-only prefetch (LDG variant) without doubling shared memory — store prefetched data in registers and write to the same single buffer after __syncthreads().
Select between LDG-register and cp.async (LDGSTS) based on the compute/load ratio per tile.
ratio = (2 * BM * BN * BK) / ((BM * BK + BK * BN) * sizeof(T)) for GEMM-like kernels (2 FLOPs per multiply-add, bytes loaded per tile).LDG-register variant (ratio >= 5 or CUDA < 11.0):
buf[N % 2] (overlaps with outstanding LDGs).__syncthreads(), then STS registers into buf[(N+1) % 2], __syncthreads().(BM * BK + BK * BN) / BLOCK_SIZE registers per thread for staging.cp.async (LDGSTS) variant (ratio < 5, CUDA >= 11.0):
__pipeline_memcpy_async tile N+1 directly to buf[(N+1) % 2] (async, bypasses register file).__pipeline_commit() before compute.buf[N % 2].__pipeline_wait_prior(0) + __syncthreads() after compute.#include <cuda_pipeline.h>.Expected: Selected variant with justification based on compute/load ratio and target architecture.
On failure: If the ratio is ambiguous (5-20:1 range), implement both variants and benchmark. The cp.async variant is the safer default when CUDA version supports it.
Transform the sequential load-sync-compute-sync loop into a pipelined prologue/loop/epilogue structure.
Identify the three sections: The original loop body becomes three pieces:
buf[0], synchronize, then enter the main loop.num_tiles - 1, overlap loading tile N+1 with computing tile N.LDG-register variant structure:
// === LDG-register variant ===
// Prologue: load tile 0 into buf[0]
cooperative_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: LDG next tile into registers (non-blocking)
float reg_a[ELEMS_PER_THREAD_A], reg_b[ELEMS_PER_THREAD_B];
prefetch_tile_to_registers(reg_a, reg_b, global_a, global_b,
(tile + 1) * BK);
// Phase 2: Compute on current buffer (overlaps with LDG flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Drain registers into next buffer
__syncthreads();
store_registers_to_smem(smem_a[next_buf], smem_b[next_buf],
reg_a, reg_b);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
// === cp.async variant ===
#include <cuda_pipeline.h>
// Prologue: async load tile 0 into buf[0]
cpasync_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__pipeline_commit();
__pipeline_wait_prior(0);
__syncthreads();
for (int tile = 0; tile < num_tiles - 1; tile++) {
int cur_buf = tile & 1;
int next_buf = 1 - cur_buf;
// Phase 1: cp.async next tile into next buffer (async, direct to smem)
cpasync_load_tile(smem_a[next_buf], smem_b[next_buf],
global_a, global_b, (tile + 1) * BK);
__pipeline_commit();
// Phase 2: Compute on current buffer (overlaps with LDGSTS in flight)
tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);
// Phase 3: Wait for async copies to complete
__pipeline_wait_prior(0);
__syncthreads();
}
// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);
num_tiles - 1 iterations (tiles 0 through num_tiles - 2 indexing which tiles to compute, loading tiles 1 through num_tiles - 1). The epilogue computes the tile loaded in the last iteration.Expected: Restructured K-loop source code with clear prologue, main loop, and epilogue sections for the chosen variant.
On failure: The most common bug is an off-by-one in buffer indexing or forgetting the epilogue compute pass. Verify: prologue loads into buf[0], first main loop iteration computes buf[0] and loads into buf[1], second iteration computes buf[1] and loads into buf[0], and so on. The epilogue computes buf[(num_tiles - 1) & 1].
Declare the double-buffered shared memory and implement the load functions.
// Before (single buffer)
__shared__ half smem_a[BM * BK];
__shared__ half smem_b[BK * BN];
// After (double buffer)
__shared__ half smem_a[2][BM * BK];
__shared__ half smem_b[2][BK * BN];
__device__ void cpasync_load_tile(half* dst_a, half* dst_b,
const half* src_a, const half* src_b,
int k_offset) {
// Each thread copies its portion (16 bytes = 8 half values per cp.async)
int tid = threadIdx.x;
int bytes_per_thread = 16; // cp.async.cg supports 4, 8, or 16 bytes
// A tile: BM * BK elements, distributed across BLOCK_SIZE threads
int elems_a = BM * BK / BLOCK_SIZE;
for (int i = 0; i < elems_a; i += 8) {
int idx = tid * elems_a + i;
__pipeline_memcpy_async(dst_a + idx,
src_a + k_offset * BM + idx,
bytes_per_thread);
}
// B tile: BK * BN elements, distributed similarly
int elems_b = BK * BN / BLOCK_SIZE;
for (int i = 0; i < elems_b; i += 8) {
int idx = tid * elems_b + i;
__pipeline_memcpy_async(dst_b + idx,
src_b + k_offset * BN + idx,
bytes_per_thread);
}
}
// Declare register staging (size = elements per thread)
half reg_a[BM * BK / BLOCK_SIZE];
half reg_b[BK * BN / BLOCK_SIZE];
// Prefetch: LDG from global to registers (non-blocking, issued early)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
reg_a[i] = global_a[k_offset * BM + idx];
}
// ... similarly for reg_b
// Store: STS from registers to shared memory (after __syncthreads)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
smem_a[next_buf][idx] = reg_a[i];
}
__launch_bounds__(BLOCK_SIZE) on the kernel to give the compiler accurate occupancy information.nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu.Expected: Compilable kernel with double-buffered shared memory and the chosen load mechanism. Successful cubin generation with no errors.
On failure: If compilation fails on pipeline API calls, ensure #include <cuda_pipeline.h> is present and CUDA toolkit is >= 11.0. If register spills occur (check nvcc --resource-usage), reduce the register staging array sizes by increasing BLOCK_SIZE or reducing BK.
Run the pipelined kernel against the CPU reference to confirm identical numerical output.
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common.abs=0.5, rel=0.1abs=1e-2, rel=1e-2abs=1e-3, rel=1e-3Expected: PASS at both small and target problem sizes with error bounds identical to the non-pipelined baseline.
On failure: Buffer indexing bug is the most likely cause. Verify: compute reads from buf[tile & 1] while loads write to buf[1 - (tile & 1)]. Check the epilogue processes buffer index (num_tiles - 1) & 1, not num_tiles & 1. For cp.async, verify __pipeline_wait_prior(0) completes before __syncthreads() — otherwise compute may read partially-written data.
Measure the pipelined kernel against the non-pipelined baseline at the target problem size.
speedup = pipelined_metric / baseline_metric.| Variant | GFLOPS | Speedup vs Baseline |
|------------------|--------|---------------------|
| Baseline | XXX | 1.00x |
| LDG-register | XXX | X.XXx |
| cp.async (LDGSTS)| XXX | X.XXx |
Expected: Performance comparison table showing improvement. The chosen variant should show measurable speedup consistent with the compute/load ratio prediction.
On failure: If performance regresses, check three things: (1) SASS for unexpected instruction overhead (extra BAR.SYNC, register spills). (2) Shared memory did not cross the occupancy cliff — verify with nvcc --resource-usage or cuobjdump -res-usage. (3) The problem size produces enough tiles (K / BK >= 4) for pipelining to amortize the prologue/epilogue overhead.
Inspect the compiled SASS to confirm that global loads and Tensor Core instructions overlap within the main loop body.
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'.LDGSTS or LDG instructions appear before HMMA or IMMA instructions.BAR.SYNC between the load instructions and the compute instructions (they must be free to overlap in the warp scheduler).BAR.SYNC appears after the compute block, gating the next iteration's use of the loaded data.# Full SASS pipeline verification
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'
# Count compute instructions per loop
cuobjdump -sass kernel.sm_86.cubin | grep -c 'HMMA\|IMMA'
# Check for register spills
nvcc --resource-usage --cubin -arch=sm_86 -O2 kernel.cu 2>&1 | grep -i spill
Expected: Annotated SASS excerpt showing the load-before-compute pattern with no intervening barriers. Zero register spills.
On failure: If the compiler reordered loads after compute (defeating the overlap), try: (1) #pragma unroll 1 on the main loop to prevent over-aggressive unrolling. (2) Separate load and compute into distinct inline functions to create a sequencing hint. (3) Use asm volatile("" ::: "memory") as a compiler fence between load and compute blocks (last resort — may inhibit other optimizations).
buf[tile & 1] pattern)buf[0]buf[(num_tiles - 1) & 1]BAR.SYNC between LDGSTS/LDG and IMMA/HMMA)nvcc --resource-usage)smem_doubled before implementing. A kernel using 28 KB single-buffered jumps to 56 KB doubled, crossing the cliff and halving occupancy. This can turn a +20% pipelining gain into a -50% occupancy regression.buf[tile & 1] for the current compute buffer and buf[1 - (tile & 1)] for the next load buffer. A common mistake is using buf[(tile + 1) & 1] for the next buffer, which is equivalent to buf[1 - (tile & 1)] only when the buffer count is 2 — but reads wrong if accidentally applied to the compute index.__pipeline_commit() must be called BEFORE the compute phase (it seals the batch of async copies). __pipeline_wait_prior(0) must be called AFTER the compute phase (it blocks until all committed copies complete). Swapping these makes the async copies synchronous, eliminating all overlap benefit.__syncthreads() is needed between compute and the STS drain (so compute finishes reading the current buffer before it gets overwritten). Another __syncthreads() is needed after the STS drain (so all threads finish writing before the next iteration reads). In the cp.async variant, __syncthreads() after __pipeline_wait_prior(0) ensures all threads see the completed async copies.__pipeline_memcpy_async requires the source address to be valid and aligned. At matrix edges where K is not a multiple of BK, the last tile may read out of bounds. Fall back to scalar loads with bounds checking for the final tile, or pad the input matrices to a multiple of BK.analyze-kernel-bottleneck — identify whether the kernel is memory-bound and calculate the compute/load ratio that drives variant selection