From cppcheatsheet
Enforces readable C/C++/Rust/CUDA code rules: short functions, flat control flow, clear naming, idiomatic patterns. Use when writing, reviewing, or refactoring.
npx claudepluginhub crazyguitar/cppcheatsheet --plugin cppcheatsheetThis skill uses the workspace's default tool permissions.
Apply these rules when writing, reviewing, or refactoring C, C++, Rust, or CUDA code. Inspired by *The Art of Readable Code* by Dustin Boswell and Trevor Foucher.
Enforces C++ Core Guidelines for writing, reviewing, and refactoring modern C++ (C++17/20/23) code to ensure type safety, resource safety, immutability, and idiomatic practices.
Enforces C++ Core Guidelines for writing, reviewing, and refactoring modern C++ code (C++17+), promoting RAII, immutability, type safety, and idiomatic practices.
Writes idiomatic modern C++ code with RAII, smart pointers, STL algorithms, templates, move semantics, and performance optimization. For refactoring, memory safety, complex patterns.
Share bugs, ideas, or general feedback.
Apply these rules when writing, reviewing, or refactoring C, C++, Rust, or CUDA code. Inspired by The Art of Readable Code by Dustin Boswell and Trevor Foucher.
Core principle: Code should be easy to understand. The time it takes someone else (or future you) to understand the code is the ultimate metric.
if inside a loop inside an if, extract the inner block into a helper function with a descriptive name.continue or break to skip iterations rather than wrapping the body in a conditional.// Bad: nested and hard to follow
for (auto& user : users) {
if (user.is_active()) {
for (auto& order : user.orders()) {
if (order.is_pending()) {
process(order);
}
}
}
}
// Good: flat, each function name explains what it does
auto active_users = get_active_users(users);
for (auto& user : active_users) {
process_pending_orders(user.orders());
}
fetch_page not get, num_retries not n.tmp, data, result, val, info, handle — unless the scope is tiny (2-3 lines).max_items not limit. If a boolean, use is_, has_, should_, can_ prefixes.num, max, min, err are fine; svc_mgr_cfg is not).if (length > 10) not if (10 < length).if/else blocks: positive case first, simpler case first, or the more interesting case first.if/else.// Bad
if (!(age >= 18 && has_id && !is_banned)) {
deny();
}
// Good
bool is_eligible = age >= 18 && has_id && !is_banned;
if (!is_eligible) {
deny();
}
if (retries > MAX_RETRIES) not if (retries > 3).// TODO:, // HACK:, // XXX: with explanation.// Bad: rustfmt expands this to 4 lines per field — noisy and repetitive
fn from_dict(cfg: &Bound<'_, PyDict>) -> PyResult<Self> {
Ok(Self {
rom: cfg.get_item("rom")?.ok_or_else(|| missing("rom"))?.extract()?,
// ... each field becomes 4 lines after rustfmt
})
}
// Good: extract a helper so each field stays one clean line
fn get_required<T: FromPyObject>(cfg: &Bound<'_, PyDict>, key: &str) -> PyResult<T> {
cfg.get_item(key)?
.ok_or_else(|| PyKeyError::new_err(key.to_string()))?
.extract()
}
fn from_dict(cfg: &Bound<'_, PyDict>) -> PyResult<Self> {
Ok(Self {
rom: get_required(cfg, "rom")?,
actions: get_required(cfg, "actions")?,
})
}
// Bad: clang-format wraps this into a hard-to-scan block
auto result = container.find(key)->second.get_value().transform(func).value_or(default_val);
// Good: name the intermediate step
auto& entry = container.find(key)->second;
auto result = entry.get_value().transform(func).value_or(default_val);
free() calls across multiple return paths. A single cleanup section is easier to audit.__attribute__((cleanup)) (GCC/Clang) when available for automatic cleanup.// Good: single cleanup path
int process_file(const char *path) {
int ret = -1;
FILE *fp = fopen(path, "r");
if (!fp) return -1;
char *buf = malloc(BUF_SIZE);
if (!buf) goto cleanup_file;
// ... do work ...
ret = 0;
cleanup_buf:
free(buf);
cleanup_file:
fclose(fp);
return ret;
}
const Liberallyconst when the function doesn't modify the pointed-to data: const char *msg.const when they don't change after initialization.<stdint.h> types (uint32_t, int64_t) for data that crosses boundaries (files, network, hardware).size_t for sizes and counts, ptrdiff_t for pointer differences.int and unsigned for simple loop counters and local arithmetic.do { ... } while(0) for statement-like macros.#define SQUARE(x) ((x) * (x)).static inline functions over macros when possible (type safety, debuggability)._Generic (C11) for type-safe "overloading" instead of macro tricks.std::unique_ptr for exclusive ownership, std::shared_ptr only when shared ownership is genuinely needed.new/delete in application code — let smart pointers and containers handle it.const&.std::move only when you truly want to transfer ownership — don't std::move from things you'll use again.std::array over C arrays, std::string over char*, std::vector over malloc/realloc.std::optional over sentinel values, std::variant over type-unsafe unions.for loops: for (const auto& item : container).auto [key, value] = *map.begin();.std::format (C++20) or fmt::format over sprintf / string concatenation.if constexpr over SFINAE when possible.constexpr and const Aggressivelyconstexpr when they can be evaluated at compile time.constexpr variables instead of #define for constants.const on member functions that don't modify state.consteval (C++20) for functions that must be compile-time evaluated.std::expected (C++23) or std::optional for expected failures.std::runtime_error, not std::exception.noexcept on functions that cannot throw (destructors, move operations).&T, &mut T) over cloning. Clone only when ownership transfer is genuinely needed.&str over String in function parameters when you don't need ownership..iter().filter().map().collect()) over manual loops with indices.for item in &collection instead of for i in 0..collection.len().enumerate(), zip(), chain(), chunks() — the iterator API is rich..unwrap() in production code — use ?, unwrap_or, unwrap_or_else, or pattern matching.// Bad: manual indexing
let mut names = Vec::new();
for i in 0..users.len() {
if users[i].is_active {
names.push(users[i].name.clone());
}
}
// Good: idiomatic iterator chain
let names: Vec<_> = users.iter()
.filter(|u| u.is_active)
.map(|u| u.name.clone())
.collect();
enum with data variants instead of class hierarchies or tagged unions.match exhaustively — the compiler ensures you handle all cases.if let / while let for single-variant matching instead of full match.Result<T, E> over panicking — make errors part of the type signature.struct UserId(u64)) to prevent mixing up same-typed values.Option<T> instead of sentinel values or null pointers.#[must_use] on functions whose return values shouldn't be ignored.From/Into traits for type conversions over manual conversion functions.pub surfaces small — expose only what's needed.pub(crate) for crate-internal visibility instead of full pub.reduce_sum not kernel1 or myKernel.reduce_sum_kernel for __global__, warp_reduce for __device__ helpers.threads_per_block, num_blocks not tpb, nb, or bare 256.// Bad: opaque names, magic numbers
__global__ void k1(float *a, float *b, int n) {
int i = blockIdx.x * 256 + threadIdx.x;
if (i < n) b[i] = a[i] * 2.0f;
}
k1<<<(n+255)/256, 256>>>(d_in, d_out, n);
// Good: clear intent, named constants
constexpr int THREADS_PER_BLOCK = 256;
__global__ void scale_kernel(const float *input, float *output,
float scale_factor, int num_elements) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < num_elements) {
output[idx] = input[idx] * scale_factor;
}
}
const int num_blocks = (num_elements + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
scale_kernel<<<num_blocks, THREADS_PER_BLOCK>>>(d_input, d_output, 2.0f, num_elements);
cudaMalloc/cudaMemcpy with application logic — wrap them in RAII classes or helper functions..cu kernel files from .cpp host logic files, or at minimum group host and device code into clearly labeled sections.// Good: RAII wrapper hides allocation/deallocation
template <typename T>
class DeviceBuffer {
T *ptr_ = nullptr;
size_t size_ = 0;
public:
explicit DeviceBuffer(size_t count) : size_(count) {
check_cuda(cudaMalloc(&ptr_, count * sizeof(T)));
}
~DeviceBuffer() { cudaFree(ptr_); }
DeviceBuffer(const DeviceBuffer&) = delete;
DeviceBuffer& operator=(const DeviceBuffer&) = delete;
DeviceBuffer(DeviceBuffer&& o) noexcept : ptr_(o.ptr_), size_(o.size_) { o.ptr_ = nullptr; }
T *get() { return ptr_; }
const T *get() const { return ptr_; }
size_t size() const { return size_; }
void copy_from_host(const T *host_data) {
check_cuda(cudaMemcpy(ptr_, host_data, size_ * sizeof(T), cudaMemcpyHostToDevice));
}
void copy_to_host(T *host_data) const {
check_cuda(cudaMemcpy(host_data, ptr_, size_ * sizeof(T), cudaMemcpyDeviceToHost));
}
};
check_cuda macro or inline function — not raw if blocks after every call.cudaGetLastError() + cudaDeviceSynchronize() during development.// Good: concise, catches file/line info
inline void check_cuda(cudaError_t err, const char *file, int line) {
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error at %s:%d — %s\n",
file, line, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
}
#define check_cuda(err) check_cuda((err), __FILE__, __LINE__)
// Usage
check_cuda(cudaMalloc(&d_ptr, size));
my_kernel<<<grid, block>>>(d_ptr, n);
check_cuda(cudaGetLastError());
check_cuda(cudaDeviceSynchronize());
if.row, col, depth — not x, y, z.// Bad: index computed inline, entire body wrapped
__global__ void process(float *data, int width, int height) {
if (blockIdx.x * blockDim.x + threadIdx.x < width &&
blockIdx.y * blockDim.y + threadIdx.y < height) {
int idx = (blockIdx.y * blockDim.y + threadIdx.y) * width +
(blockIdx.x * blockDim.x + threadIdx.x);
data[idx] = data[idx] * 2.0f;
}
}
// Good: named indices, early return
__global__ void process(float *data, int width, int height) {
const int col = blockIdx.x * blockDim.x + threadIdx.x;
const int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col >= width || row >= height) return;
const int idx = row * width + col;
data[idx] = data[idx] * 2.0f;
}
shared_tile not smem or s.extern __shared__).// Good: clear phases, descriptive names
__global__ void tiled_matmul_kernel(const float *A, const float *B,
float *C, int N) {
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];
const int row = blockIdx.y * TILE_SIZE + threadIdx.y;
const int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float accumulator = 0.0f;
for (int tile_idx = 0; tile_idx < N / TILE_SIZE; ++tile_idx) {
// Phase 1: Load tiles from global memory
tile_A[threadIdx.y][threadIdx.x] = A[row * N + tile_idx * TILE_SIZE + threadIdx.x];
tile_B[threadIdx.y][threadIdx.x] = B[(tile_idx * TILE_SIZE + threadIdx.y) * N + col];
__syncthreads();
// Phase 2: Compute partial dot product from tiles
for (int k = 0; k < TILE_SIZE; ++k) {
accumulator += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = accumulator;
}
__device__ helper functions.__forceinline__ __device__ for small helpers that you want inlined without relying on compiler heuristics.// Good: kernel reads like pseudocode, details in helpers
__forceinline__ __device__
float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xffffffff, val, offset);
}
return val;
}
__forceinline__ __device__
float block_reduce_sum(float val) {
__shared__ float warp_sums[32];
const int lane = threadIdx.x % warpSize;
const int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) warp_sums[warp_id] = val;
__syncthreads();
val = (threadIdx.x < blockDim.x / warpSize) ? warp_sums[lane] : 0.0f;
if (warp_id == 0) val = warp_reduce_sum(val);
return val;
}
__global__ void reduce_sum_kernel(const float *input, float *output, int n) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
const float val = (idx < n) ? input[idx] : 0.0f;
const float block_sum = block_reduce_sum(val);
if (threadIdx.x == 0) atomicAdd(output, block_sum);
}
const on kernel parameters for read-only device pointers — documents intent and enables compiler optimizations.__restrict__ when pointers don't alias — but add a comment explaining the non-aliasing guarantee.cudaMallocManaged), comment the expected access pattern (host-only init, device-only compute, etc.) — the implicit page migration behavior is not obvious.// Good: const + restrict with clear intent
__global__ void vector_add_kernel(
const float *__restrict__ a, // read-only, no alias with output
const float *__restrict__ b, // read-only, no alias with output
float *__restrict__ output, // write-only
int num_elements)
{
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= num_elements) return;
output[idx] = a[idx] + b[idx];
}
__syncthreads() on its own line, never buried inside a conditional branch that not all threads take — this is undefined behavior and hard to spot.__syncthreads() stating what invariant it establishes: "all threads have loaded their tile", "partial sums are written to shared memory".__shfl_sync, __ballot_sync) with explicit masks over __syncthreads().// Bad: magic numbers, unclear intent
foo<<<(n+127)/128, 128, 0, stream>>>(d_ptr, n);
// Good: named, computed, self-documenting
void launch_scale_kernel(float *d_data, float factor, int n, cudaStream_t stream) {
constexpr int BLOCK_SIZE = 256;
const int grid_size = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
scale_kernel<<<grid_size, BLOCK_SIZE, 0, stream>>>(d_data, factor, n);
check_cuda(cudaGetLastError());
}
compute_stream, transfer_stream — not s1, s2.// Good: dependency graph documented, streams named by purpose
// Dependency graph:
// upload (transfer_stream) --> compute (compute_stream) --> download (transfer_stream)
// Event 'upload_done' gates compute start.
// Event 'compute_done' gates download start.
cudaStream_t transfer_stream, compute_stream;
cudaEvent_t upload_done, compute_done;
// Stage 1: async upload
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, transfer_stream);
cudaEventRecord(upload_done, transfer_stream);
// Stage 2: compute waits for upload
cudaStreamWaitEvent(compute_stream, upload_done);
process_kernel<<<grid, block, 0, compute_stream>>>(d_input, d_output, n);
cudaEventRecord(compute_done, compute_stream);
// Stage 3: download waits for compute
cudaStreamWaitEvent(transfer_stream, compute_done);
cudaMemcpyAsync(h_output, d_output, size, cudaMemcpyDeviceToHost, transfer_stream);
// Sync before host reads the result
cudaStreamSynchronize(transfer_stream);