Skill

cuda

Guides writing .cu kernels in C/C++ and using cuBLAS, cuDNN, cuFFT, cuSPARSE, cuRAND, cuSolver, Thrust, Cooperative Groups for GPU-accelerated computing.

C++

ai-ml

performance

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/cuda:cuda

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

CUDA is NVIDIA's parallel computing platform and programming model for GPU-accelerated applications. It provides direct access to the GPU's virtual instruction set and parallel compute elements for executing kernels in C, C++, and Fortran.

Supporting Files

references/api-cooperative-groups.mdreferences/api-cublas.mdreferences/api-cufft.mdreferences/api-curand.mdreferences/api-cusolver.mdreferences/api-cusparse.mdreferences/api-runtime.mdreferences/api-thrust.mdreferences/workflows.md

SKILL.md

86 lines · ~1k tokens

Stats

LanguageShell

Parent stars10

Parent forks1

MaintenanceExcellent

Last CommitMar 13, 2026

Actions

View Source View Plugin View on GitHub View README

CUDA

Overview

cuda-samples version: v13.1 (CUDA Toolkit 13.1) CUDALibrarySamples: main (Feb 2025) Language: C/C++ (.cu files) Licenses: BSD-3-Clause (cuda-samples), Apache-2.0 (CUDALibrarySamples)

Quick Start

// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Core Concepts

Kernel: __global__ function executed on GPU by many parallel threads
Grid/Block/Thread: Launch hierarchy — <<<gridDim, blockDim>>> configures parallelism
Device memory: Must be explicitly allocated with cudaMalloc and freed with cudaFree
Streams: Async execution queues; default stream is synchronous with host
Unified Memory (cudaMallocManaged): Automatically migrates data between CPU and GPU

API Reference

Domain	File	Description
CUDA Runtime	api-runtime.md	Device mgmt, memory, streams, events, kernel launch
cuBLAS	api-cublas.md	Dense linear algebra: GEMM, GEMV, TRSM, batched ops
cuFFT	api-cufft.md	1D/2D/3D FFT and batched transforms
cuSPARSE	api-cusparse.md	Sparse matrix ops: SpMM, SpMV, format conversions
cuRAND	api-curand.md	Random number generation on GPU
cuSolver	api-cusolver.md	Dense/sparse solvers: QR, LU, eigenvalue, SVD
Thrust	api-thrust.md	STL-like GPU algorithms: sort, reduce, transform, scan
Cooperative Groups	api-cooperative-groups.md	Flexible thread synchronization beyond blocks
Workflows	workflows.md	Complete working examples

Common Workflows

See references/workflows.md for complete examples.

Quick reference:

Matrix multiply: see workflows.md — cuBLAS GEMM section
FFT: see workflows.md — cuFFT 1D section
Custom kernel: see workflows.md — Custom CUDA Kernel section
Unified memory: see workflows.md — Unified Memory section
Error-check macros: see workflows.md — Error-Checked CUDA Boilerplate

Key Considerations

Error checking: Always check return codes; use CUDA_CHECK(err) macro pattern
Synchronization: cudaDeviceSynchronize() or stream synchronize before reading results on host
Memory alignment: 128-byte alignment for coalesced global memory access
Occupancy: Use cudaOccupancyMaxPotentialBlockSize to tune block dimensions
Tensor Cores: Available on Volta+ (sm_70+); cuBLAS uses them automatically for GEMM with correct types
Column-major: cuBLAS and cuSolver use Fortran (column-major) layout — transpose row-major C arrays or swap dimensions
cuFFT normalization: cuFFT does NOT normalize inverse transforms; divide by N manually
Streams: Always use cudaStreamNonBlocking when creating non-default streams to avoid implicit synchronization with the null stream

cuda

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

cuda

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

CUDA

Overview

Quick Start

Core Concepts

API Reference

Common Workflows

Key Considerations

Similar Skills

CUDA

Overview

Quick Start

Core Concepts

API Reference

Common Workflows

Key Considerations

Similar Skills