NVIDIA CUDA parallel computing platform — use when writing .cu kernels, using cuBLAS/cuDNN/cuFFT/cuSPARSE/cuRAND/cuSolver, Thrust, or Cooperative Groups for GPU-accelerated computing
From cudanpx claudepluginhub datathings/marketplace --plugin cudaThis skill uses the workspace's default tool permissions.
references/api-cooperative-groups.mdreferences/api-cublas.mdreferences/api-cufft.mdreferences/api-curand.mdreferences/api-cusolver.mdreferences/api-cusparse.mdreferences/api-runtime.mdreferences/api-thrust.mdreferences/workflows.mdGuides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Generates FastAPI project templates with async routes, dependency injection, Pydantic schemas, repository patterns, middleware, and config for PostgreSQL/MongoDB backends.
CUDA is NVIDIA's parallel computing platform and programming model for GPU-accelerated applications. It provides direct access to the GPU's virtual instruction set and parallel compute elements for executing kernels in C, C++, and Fortran.
cuda-samples version: v13.1 (CUDA Toolkit 13.1) CUDALibrarySamples: main (Feb 2025) Language: C/C++ (.cu files) Licenses: BSD-3-Clause (cuda-samples), Apache-2.0 (CUDALibrarySamples)
// Minimal kernel + launch
__global__ void addVectors(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
int main() {
int n = 1 << 20;
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));
int threads = 256;
int blocks = (n + threads - 1) / threads;
addVectors<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaDeviceSynchronize();
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
__global__ function executed on GPU by many parallel threads<<<gridDim, blockDim>>> configures parallelismcudaMalloc and freed with cudaFreecudaMallocManaged): Automatically migrates data between CPU and GPU| Domain | File | Description |
|---|---|---|
| CUDA Runtime | api-runtime.md | Device mgmt, memory, streams, events, kernel launch |
| cuBLAS | api-cublas.md | Dense linear algebra: GEMM, GEMV, TRSM, batched ops |
| cuFFT | api-cufft.md | 1D/2D/3D FFT and batched transforms |
| cuSPARSE | api-cusparse.md | Sparse matrix ops: SpMM, SpMV, format conversions |
| cuRAND | api-curand.md | Random number generation on GPU |
| cuSolver | api-cusolver.md | Dense/sparse solvers: QR, LU, eigenvalue, SVD |
| Thrust | api-thrust.md | STL-like GPU algorithms: sort, reduce, transform, scan |
| Cooperative Groups | api-cooperative-groups.md | Flexible thread synchronization beyond blocks |
| Workflows | workflows.md | Complete working examples |
See references/workflows.md for complete examples.
Quick reference:
CUDA_CHECK(err) macro patterncudaDeviceSynchronize() or stream synchronize before reading results on hostcudaOccupancyMaxPotentialBlockSize to tune block dimensionscudaStreamNonBlocking when creating non-default streams to avoid implicit synchronization with the null stream