Skill

pipeline-gpu-kernel

Applies software pipelining (double-buffering) to tiled CUDA GPU kernels to overlap global memory loads with Tensor Core compute. Selects LDG vs cp.async variants, verifies smem occupancy cliffs, and checks SASS overlap for memory-bound kernels.

performance

npx claudepluginhub pjt222/agent-almanac

Tool Access

This skill is limited to using the following tools:

Read Write Edit Bash Grep Glob

Preview

Apply software pipelining (double-buffering) to a tiled GPU kernel so that global memory loads for tile N+1 overlap with Tensor Core computation on tile N. Transform a sequential load-sync-compute-sync K-loop into a prologue/loop/epilogue structure, choose between LDG-register and cp.async (LDGSTS) variants based on compute/load ratio, verify shared memory stays under the architecture occupancy...

SKILL.md

Similar Skills

analyze-kernel-bottleneck

Classifies CUDA GPU kernel bottlenecks (compute-bound, memory-bound, latency-bound) via roofline analysis, occupancy calculations, compute/load ratio, SASS inspection, and decision matrix for optimizations like cp.async, tiling.

1 tool

agent-almanac

tilekernels-gpu-kernels

Provides optimized GPU kernels for LLM operations including MoE routing, FP8/FP4 quantization, transpose, engram gating, and Manifold HyperConnection using TileLang on NVIDIA Hopper/Blackwell GPUs.

aradotso-trending-skills-37

skill-lookup

159.9k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

Stats

Stars12

Forks1

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Pipeline GPU Kernel

When to Use

When analyze-kernel-bottleneck identifies a memory-bound kernel with low compute/load ratio per tile
When warp interleaving alone cannot hide DRAM latency (~300 cycles on GA104)
When the kernel has a sequential load-sync-compute-sync K-loop that can be restructured
Not needed when compute/load ratio is high (>20:1) and 8+ warps are active

Inputs

Required: CUDA kernel source file (.cu) with a tiled K-loop containing separate load and compute phases
Required: Target GPU architecture (e.g., GA104 / sm_86 — determines smem cliff and occupancy limits)
Required: Current tile sizes (BM, BN, BK) and data type (FP16, FP32, INT8)
Optional: Compute/load ratio per tile (from analyze-kernel-bottleneck; will be estimated if not provided)
Optional: Benchmark baseline (non-pipelined performance at target problem size)

Procedure

Step 1: Verify Preconditions

Confirm the kernel has a tiled K-loop with distinct load and compute phases separated by __syncthreads(). Calculate the doubled shared memory cost and verify it stays under the architecture occupancy cliff.

Locate the K-loop in the kernel. It must have this sequential structure: load A and B tiles from global to shared memory, __syncthreads(), compute (HMMA/IMMA/FFMA) on the shared memory tiles, __syncthreads().
Record the single-buffer shared memory sizes: smem_a_size = BM * BK * sizeof(T) and smem_b_size = BK * BN * sizeof(T).
Calculate the double-buffer cost: smem_doubled = smem_a_size * 2 + smem_b_size * 2.
Compare against the architecture cliff. GA104 (sm_86): 100 KB max smem/SM, cliff at 50 KB/block (above 50 KB = 1 block/SM = 4 warps, 2x occupancy collapse).

Single buffer: smem_a[BM*BK] + smem_b[BK*BN] = 2 KB + 2 KB = 4 KB
Double buffer: smem_a[2][BM*BK] + smem_b[2][BK*BN] = 4 KB + 4 KB = 8 KB
8 KB << 50 KB cliff -> 2 blocks/SM -> 8 warps

Verify the loop iteration count: num_tiles = K / BK. Pipelining requires num_tiles >= 2 (at least one prologue + one main loop iteration).

Expected: A shared memory budget table showing single-buffer and double-buffer costs, confirming the doubled allocation stays under the architecture cliff with at least 2 blocks/SM occupancy.

On failure: If double-buffer exceeds the cliff, reduce tile size (halve BK or BM) until smem_doubled <= 50 KB for GA104. Alternatively, use register-only prefetch (LDG variant) without doubling shared memory — store prefetched data in registers and write to the same single buffer after __syncthreads().

Step 2: Choose Variant

Select between LDG-register and cp.async (LDGSTS) based on the compute/load ratio per tile.

Calculate the compute/load ratio: ratio = (2 * BM * BN * BK) / ((BM * BK + BK * BN) * sizeof(T)) for GEMM-like kernels (2 FLOPs per multiply-add, bytes loaded per tile).
Apply the decision rule:

LDG-register variant (ratio >= 5 or CUDA < 11.0):

LDG tile N+1 into registers (non-blocking global loads).
Compute on buf[N % 2] (overlaps with outstanding LDGs).
__syncthreads(), then STS registers into buf[(N+1) % 2], __syncthreads().
Simpler implementation, no pipeline API dependency.
Adds register pressure: ~(BM * BK + BK * BN) / BLOCK_SIZE registers per thread for staging.

cp.async (LDGSTS) variant (ratio < 5, CUDA >= 11.0):

__pipeline_memcpy_async tile N+1 directly to buf[(N+1) % 2] (async, bypasses register file).
__pipeline_commit() before compute.
Compute on buf[N % 2].
__pipeline_wait_prior(0) + __syncthreads() after compute.
Better overlap, zero register pressure for prefetch. Requires #include <cuda_pipeline.h>.

Decision thresholds (measured on GA104 with IGEMM at 4096x4096x4096):
- Ratio < 5:1 — prefer cp.async (+35% measured on IGEMM).
- Ratio 5-20:1 — implement both and benchmark to decide.
- Ratio > 20:1 — pipelining likely not beneficial (warp interleaving sufficient).

Expected: Selected variant with justification based on compute/load ratio and target architecture.

On failure: If the ratio is ambiguous (5-20:1 range), implement both variants and benchmark. The cp.async variant is the safer default when CUDA version supports it.

Step 3: Restructure the K-Loop

Transform the sequential load-sync-compute-sync loop into a pipelined prologue/loop/epilogue structure.

Identify the three sections: The original loop body becomes three pieces:
- Prologue: Load tile 0 into buf[0], synchronize, then enter the main loop.
- Main loop: For tiles 1 through num_tiles - 1, overlap loading tile N+1 with computing tile N.
- Epilogue: Compute the last tile (already loaded by the final main loop iteration).
LDG-register variant structure:

// === LDG-register variant ===
// Prologue: load tile 0 into buf[0]
cooperative_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__syncthreads();

for (int tile = 0; tile < num_tiles - 1; tile++) {
    int cur_buf = tile & 1;
    int next_buf = 1 - cur_buf;

    // Phase 1: LDG next tile into registers (non-blocking)
    float reg_a[ELEMS_PER_THREAD_A], reg_b[ELEMS_PER_THREAD_B];
    prefetch_tile_to_registers(reg_a, reg_b, global_a, global_b,
                               (tile + 1) * BK);

    // Phase 2: Compute on current buffer (overlaps with LDG flight)
    tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);

    // Phase 3: Drain registers into next buffer
    __syncthreads();
    store_registers_to_smem(smem_a[next_buf], smem_b[next_buf],
                            reg_a, reg_b);
    __syncthreads();
}

// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);

cp.async variant structure:

// === cp.async variant ===
#include <cuda_pipeline.h>

// Prologue: async load tile 0 into buf[0]
cpasync_load_tile(smem_a[0], smem_b[0], global_a, global_b, /*k_offset=*/0);
__pipeline_commit();
__pipeline_wait_prior(0);
__syncthreads();

for (int tile = 0; tile < num_tiles - 1; tile++) {
    int cur_buf = tile & 1;
    int next_buf = 1 - cur_buf;

    // Phase 1: cp.async next tile into next buffer (async, direct to smem)
    cpasync_load_tile(smem_a[next_buf], smem_b[next_buf],
                      global_a, global_b, (tile + 1) * BK);
    __pipeline_commit();

    // Phase 2: Compute on current buffer (overlaps with LDGSTS in flight)
    tensor_core_mma(smem_a[cur_buf], smem_b[cur_buf], acc);

    // Phase 3: Wait for async copies to complete
    __pipeline_wait_prior(0);
    __syncthreads();
}

// Epilogue: compute last tile
tensor_core_mma(smem_a[(num_tiles - 1) & 1], smem_b[(num_tiles - 1) & 1], acc);

Verify the loop count: the main loop runs num_tiles - 1 iterations (tiles 0 through num_tiles - 2 indexing which tiles to compute, loading tiles 1 through num_tiles - 1). The epilogue computes the tile loaded in the last iteration.

Expected: Restructured K-loop source code with clear prologue, main loop, and epilogue sections for the chosen variant.

On failure: The most common bug is an off-by-one in buffer indexing or forgetting the epilogue compute pass. Verify: prologue loads into buf[0], first main loop iteration computes buf[0] and loads into buf[1], second iteration computes buf[1] and loads into buf[0], and so on. The epilogue computes buf[(num_tiles - 1) & 1].

Step 4: Implement Double-Buffer

Declare the double-buffered shared memory and implement the load functions.

Replace single-buffer shared memory declarations with double-buffered arrays:

// Before (single buffer)
__shared__ half smem_a[BM * BK];
__shared__ half smem_b[BK * BN];

// After (double buffer)
__shared__ half smem_a[2][BM * BK];
__shared__ half smem_b[2][BK * BN];

For the cp.async variant, implement the async load function using the pipeline API:

__device__ void cpasync_load_tile(half* dst_a, half* dst_b,
                                  const half* src_a, const half* src_b,
                                  int k_offset) {
    // Each thread copies its portion (16 bytes = 8 half values per cp.async)
    int tid = threadIdx.x;
    int bytes_per_thread = 16;  // cp.async.cg supports 4, 8, or 16 bytes

    // A tile: BM * BK elements, distributed across BLOCK_SIZE threads
    int elems_a = BM * BK / BLOCK_SIZE;
    for (int i = 0; i < elems_a; i += 8) {
        int idx = tid * elems_a + i;
        __pipeline_memcpy_async(dst_a + idx,
                                src_a + k_offset * BM + idx,
                                bytes_per_thread);
    }

    // B tile: BK * BN elements, distributed similarly
    int elems_b = BK * BN / BLOCK_SIZE;
    for (int i = 0; i < elems_b; i += 8) {
        int idx = tid * elems_b + i;
        __pipeline_memcpy_async(dst_b + idx,
                                src_b + k_offset * BN + idx,
                                bytes_per_thread);
    }
}

For the LDG variant, implement register staging arrays and store functions:

// Declare register staging (size = elements per thread)
half reg_a[BM * BK / BLOCK_SIZE];
half reg_b[BK * BN / BLOCK_SIZE];

// Prefetch: LDG from global to registers (non-blocking, issued early)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
    int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
    reg_a[i] = global_a[k_offset * BM + idx];
}
// ... similarly for reg_b

// Store: STS from registers to shared memory (after __syncthreads)
for (int i = 0; i < BM * BK / BLOCK_SIZE; i++) {
    int idx = threadIdx.x * (BM * BK / BLOCK_SIZE) + i;
    smem_a[next_buf][idx] = reg_a[i];
}

Keep __launch_bounds__(BLOCK_SIZE) on the kernel to give the compiler accurate occupancy information.
Compile: nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu.

Expected: Compilable kernel with double-buffered shared memory and the chosen load mechanism. Successful cubin generation with no errors.

On failure: If compilation fails on pipeline API calls, ensure #include <cuda_pipeline.h> is present and CUDA toolkit is >= 11.0. If register spills occur (check nvcc --resource-usage), reduce the register staging array sizes by increasing BLOCK_SIZE or reducing BK.

Step 5: Verify Correctness

Run the pipelined kernel against the CPU reference to confirm identical numerical output.

Compile the benchmark: nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common.
Run at a small problem size first (512x512x512) to catch indexing bugs before scaling up.
Apply the correct tolerance for the data type:
- INT8 Tensor Core (IMMA): abs=0.5, rel=0.1
- FP16 Tensor Core (HMMA): abs=1e-2, rel=1e-2
- FP32 scalar (FFMA): abs=1e-3, rel=1e-3
Pipelining does not change the arithmetic — it only reorders loads. If correctness fails, the bug is in buffer indexing, not in the compute logic.
Test at the target problem size (e.g., 4096x4096x4096) to verify boundary handling.

Expected: PASS at both small and target problem sizes with error bounds identical to the non-pipelined baseline.

On failure: Buffer indexing bug is the most likely cause. Verify: compute reads from buf[tile & 1] while loads write to buf[1 - (tile & 1)]. Check the epilogue processes buffer index (num_tiles - 1) & 1, not num_tiles & 1. For cp.async, verify __pipeline_wait_prior(0) completes before __syncthreads() — otherwise compute may read partially-written data.

Step 6: Benchmark and Compare

Measure the pipelined kernel against the non-pipelined baseline at the target problem size.

Run the non-pipelined baseline and record GFLOPS or bandwidth (depending on kernel type).
Run each pipelined variant and record the same metric.
Calculate speedup: speedup = pipelined_metric / baseline_metric.
Expected gains by compute/load ratio (measured on GA104):
- Low ratio (<5:1): +15-35% from cp.async (IGEMM measured: LDG +18%, cp.async +35% at 4096x4096x4096).
- Medium ratio (5-20:1): +5-15%.
- High ratio (>20:1): 0-5% or regression.
If both variants were implemented, select the faster one for production use.

| Variant          | GFLOPS | Speedup vs Baseline |
|------------------|--------|---------------------|
| Baseline         | XXX    | 1.00x               |
| LDG-register     | XXX    | X.XXx               |
| cp.async (LDGSTS)| XXX    | X.XXx               |

Expected: Performance comparison table showing improvement. The chosen variant should show measurable speedup consistent with the compute/load ratio prediction.

On failure: If performance regresses, check three things: (1) SASS for unexpected instruction overhead (extra BAR.SYNC, register spills). (2) Shared memory did not cross the occupancy cliff — verify with nvcc --resource-usage or cuobjdump -res-usage. (3) The problem size produces enough tiles (K / BK >= 4) for pipelining to amortize the prologue/epilogue overhead.

Step 7: Verify SASS Overlap

Inspect the compiled SASS to confirm that global loads and Tensor Core instructions overlap within the main loop body.

Disassemble: cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'.
In the main loop body, verify this ordering pattern:
- LDGSTS or LDG instructions appear before HMMA or IMMA instructions.
- No BAR.SYNC between the load instructions and the compute instructions (they must be free to overlap in the warp scheduler).
- BAR.SYNC appears after the compute block, gating the next iteration's use of the loaded data.
Check stall codes on HMMA/IMMA instructions — S08 for HMMA pipeline delay is expected and unavoidable. S01-S04 for IMMA is normal. Stalls on LDG/LDGSTS should be low (S01) since the warp scheduler can switch to compute while loads are in flight.
Count total HMMA/IMMA instructions per loop iteration — this should match the non-pipelined version (pipelining should not change compute volume).

# Full SASS pipeline verification
cuobjdump -sass kernel.sm_86.cubin | grep -E 'IMMA|HMMA|LDGSTS|LDG|BAR'

# Count compute instructions per loop
cuobjdump -sass kernel.sm_86.cubin | grep -c 'HMMA\|IMMA'

# Check for register spills
nvcc --resource-usage --cubin -arch=sm_86 -O2 kernel.cu 2>&1 | grep -i spill

Expected: Annotated SASS excerpt showing the load-before-compute pattern with no intervening barriers. Zero register spills.

On failure: If the compiler reordered loads after compute (defeating the overlap), try: (1) #pragma unroll 1 on the main loop to prevent over-aggressive unrolling. (2) Separate load and compute into distinct inline functions to create a sequencing hint. (3) Use asm volatile("" ::: "memory") as a compiler fence between load and compute blocks (last resort — may inhibit other optimizations).

Validation

Double-buffer smem stays under architecture cliff (GA104: 50 KB/block)
Both buffers used alternately (buf[tile & 1] pattern)
Prologue loads tile 0 into buf[0]
Epilogue computes last tile from buf[(num_tiles - 1) & 1]
Correctness PASS against CPU reference at both small and target sizes
SASS confirms load/compute overlap (no BAR.SYNC between LDGSTS/LDG and IMMA/HMMA)
Performance improved over non-pipelined baseline
No register spill from LDG variant (check nvcc --resource-usage)

Common Pitfalls

Crossing the smem cliff by doubling buffers — GA104 cliff is 50 KB/block, not 64 KB. Always calculate smem_doubled before implementing. A kernel using 28 KB single-buffered jumps to 56 KB doubled, crossing the cliff and halving occupancy. This can turn a +20% pipelining gain into a -50% occupancy regression.
Forgetting the epilogue compute pass — The last tile loaded in the final main loop iteration needs its own compute phase outside the loop. Without it, the last BK columns of the K dimension are silently dropped, producing incorrect results that may appear as small numerical errors rather than obvious failures.
Buffer indexing off-by-one — Use buf[tile & 1] for the current compute buffer and buf[1 - (tile & 1)] for the next load buffer. A common mistake is using buf[(tile + 1) & 1] for the next buffer, which is equivalent to buf[1 - (tile & 1)] only when the buffer count is 2 — but reads wrong if accidentally applied to the compute index.
cp.async commit/wait ordering — __pipeline_commit() must be called BEFORE the compute phase (it seals the batch of async copies). __pipeline_wait_prior(0) must be called AFTER the compute phase (it blocks until all committed copies complete). Swapping these makes the async copies synchronous, eliminating all overlap benefit.
Missing __syncthreads — In the LDG variant, a __syncthreads() is needed between compute and the STS drain (so compute finishes reading the current buffer before it gets overwritten). Another __syncthreads() is needed after the STS drain (so all threads finish writing before the next iteration reads). In the cp.async variant, __syncthreads() after __pipeline_wait_prior(0) ensures all threads see the completed async copies.
Boundary handling in cp.async — __pipeline_memcpy_async requires the source address to be valid and aligned. At matrix edges where K is not a multiple of BK, the last tile may read out of bounds. Fall back to scalar loads with bounds checking for the final tile, or pad the input matrices to a multiple of BK.

Related Skills

analyze-kernel-bottleneck — identify whether the kernel is memory-bound and calculate the compute/load ratio that drives variant selection