Rust GPU Compute Development Guide — helps developers write GPU-accelerated Rust code with correct data-parallel patterns and awareness of GPU execution constraints. Use when the user mentions GPU compute, WGSL, compute shaders, wgpu, boids, particle systems, simulations, data-parallel Rust, or VectorWare-style Rust runtimes.
From rust-gpu-computenpx claudepluginhub weykon/rust-gpu-computeThis skill uses the workspace's default tool permissions.
This skill is based on a simple but important idea:
A Rust-flavored API can make GPU programming easier to write, but it does not remove the GPU execution model.
If a runtime lets you express GPU work with Rust code, you still need to respect warp behavior, memory bandwidth, synchronization costs, and host↔device transfer overhead.
This guide is qualitative rather than benchmark-driven. It is intended to help with reasoning and review, not to provide universal thresholds or hardware-independent performance guarantees.
This guide is mainly for qualitative reasoning and code review.
It is intended to help answer questions like:
It is not intended to claim:
Threads inside the same warp execute together. If some threads take one branch and others take another branch, the warp often serializes those paths.
Bad pattern:
if (local_id == 0u) {
sum += values[i];
} else {
sum -= values[i];
}
Better pattern:
sum += values[i];
Practical guidance:
GPU acceleration is not free. Uploading data to the GPU and downloading results has a fixed cost.
Rule of thumb: if the dataset is small or the computation is short-lived, CPU execution is often faster.
Bad pattern:
for batch in data.chunks(100) {
gpu.upload(batch);
gpu.run();
gpu.download();
}
Better pattern:
let device_buffer = gpu.upload(&data);
for _ in 0..iterations {
gpu.run(&device_buffer);
}
let result = gpu.download(&device_buffer);
Practical guidance:
Depending on the stack, backend, and runtime, GPU-side Rust often supports a restricted subset of the language or makes some language features impractical in hot device code.
Common limitations or pressure points include:
unwrap() / panic!() behavior is often unsupported, undesirable, or highly runtime-specific on devicestd support is often reduced or replaced by a more restricted environmentPreferred mindset:
Adjacent threads should access adjacent memory addresses. If thread access is strided or random, memory bandwidth drops quickly.
Bad pattern:
let value = data[global_id * STRIDE];
Better pattern:
let value = data[global_id];
Practical guidance:
Not every compute task should go to the GPU.
| Workload | Best default |
|---|---|
| large map / reduce style transforms | often GPU |
| image processing / filters | often GPU |
| Monte Carlo simulation | often GPU |
| particle updates | often GPU |
| heavy branchy business logic | often CPU |
| small, short-lived workloads with low arithmetic intensity | often CPU |
| irregular pointer-chasing | often CPU |
| nested, data-dependent control flow | usually CPU |
These categories are not interchangeable. They refer to different layers of the stack: shader authoring, code generation / compilation, and runtime execution model.
Traditional WGSL path
Rust host code -> WGSL kernel -> GPU
VectorWare-style runtime path
Rust host code -> Rust thread-like API -> runtime lowering -> GPU warp execution
Rust-to-WGSL transpiler goal
Rust compute DSL -> generated WGSL / SPIR-V -> GPU
The point of this comparison is to separate layers, not to imply that all three approaches are equally mature, equally portable, or interchangeable in practice.
A VectorWare-style model should be read as one runtime approach, not as a universal replacement for shader programming. It can make some Rust concurrency patterns feel more natural, but it does not erase GPU-specific resource limits or performance trade-offs.
You get:
You do not get:
Boids are a good example of code that looks easy on the CPU but needs restructuring for the GPU.
CPU-shaped version:
fn update_boid(boid: &mut Boid, neighbors: &[Boid]) {
for neighbor in neighbors {
if distance(boid, neighbor) < RANGE {
boid.velocity += separation(boid, neighbor);
}
}
}
This is usually not GPU-friendly because it combines:
More GPU-friendly direction:
In practice, a scalable boids pipeline often looks more like: build grid -> index or sort -> neighbor pass -> integrate. The exact shape depends on the simulation size and memory budget, but naive all-pairs traversal stops scaling quickly.
When helping with Rust GPU compute code:
Before moving a Rust workload to the GPU, check:
This repository exists to capture practical constraints and working patterns for Rust GPU compute, and to make it easier to reason about when Rust-on-GPU approaches are a good fit.