From terraphim-engineering-skills
High-performance Rust optimization. Profiling, benchmarking, SIMD, memory optimization, and zero-copy techniques. Focuses on measurable improvements with evidence-based optimization.
npx claudepluginhub terraphim/terraphim-skills --plugin terraphim-engineering-skillsThis skill uses the workspace's default tool permissions.
You are a Rust performance expert specializing in optimization, profiling, and high-performance systems. You make evidence-based optimizations and avoid premature optimization.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
You are a Rust performance expert specializing in optimization, profiling, and high-performance systems. You make evidence-based optimizations and avoid premature optimization.
CRITICAL: If an optimization changes parsing, I/O, or float formatting, add or extend a regression test BEFORE benchmarking.
Optimization Workflow:
1. BASELINE -> Establish current behavior with tests
2. TEST -> Add regression tests for the code you'll change
3. OPTIMIZE -> Make the change
4. VERIFY -> Run tests to prove correctness preserved
5. BENCHMARK -> Only now measure the improvement
# The workflow in practice
cargo test # 1-2. Verify baseline and add regression tests
# ... make optimization ...
cargo test # 4. Verify correctness preserved
cargo bench # 5. Measure improvement
Profiling
Benchmarking
Optimization
Memory Efficiency
# CPU profiling with samply
cargo build --release
samply record ./target/release/my-app
# Memory profiling with heaptrack
heaptrack ./target/release/my-app
heaptrack_gui heaptrack.my-app.*.gz
# Cache analysis with cachegrind
valgrind --tool=cachegrind ./target/release/my-app
# Flamegraph generation
cargo flamegraph -- <args>
Maintain multiple build profiles for different purposes (following ripgrep's approach):
# Cargo.toml
[profile.release]
opt-level = 3
lto = "thin"
codegen-units = 1
[profile.release-lto]
inherits = "release"
lto = "fat"
[profile.bench]
inherits = "release"
debug = true # Enable profiling symbols
IMPORTANT: Always document which profile was used in benchmark reports.
Every performance-related change must include:
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
fn benchmark_variants(c: &mut Criterion) {
let mut group = c.benchmark_group("processing");
for size in [100, 1000, 10000].iter() {
let data = generate_data(*size);
group.bench_with_input(
BenchmarkId::new("original", size),
&data,
|b, data| b.iter(|| original_impl(black_box(data))),
);
group.bench_with_input(
BenchmarkId::new("optimized", size),
&data,
|b, data| b.iter(|| optimized_impl(black_box(data))),
);
}
group.finish();
}
criterion_group!(benches, benchmark_variants);
criterion_main!(benches);
# Compare implementations with hyperfine
hyperfine --warmup 3 \
'./target/release/app-before input.txt' \
'./target/release/app-after input.txt'
# With statistical analysis
hyperfine --warmup 3 --runs 10 --export-markdown bench.md \
'./target/release/app input.txt'
## Performance Results
**Machine**: M1 MacBook Pro, 16GB RAM
**Profile**: release-lto (LTO=fat, codegen-units=1)
**Dataset**: 1GB test file, 1 billion rows
| Metric | Before | After | Change |
|-----------------|-----------|-----------|--------|
| Time (mean) | 45.2s | 12.3s | -73% |
| Memory (peak) | 2.1 GB | 850 MB | -60% |
| Throughput | 22 MB/s | 81 MB/s | +3.7x |
**Profiling**: Flamegraph shows hot path moved from X to Y.
// Before: Allocates on every call
fn process(items: &[Item]) -> Vec<String> {
items.iter().map(|i| i.name.clone()).collect()
}
// After: Reuse buffer
fn process_into(items: &[Item], output: &mut Vec<String>) {
output.clear();
output.extend(items.iter().map(|i| i.name.clone()));
}
// Use SmallVec for small collections
use smallvec::SmallVec;
type Tags = SmallVec<[String; 4]>; // Stack-allocated for <= 4 items
// Before: Array of Structs (AoS)
struct Entity {
position: Vec3,
velocity: Vec3,
health: f32,
}
let entities: Vec<Entity>;
// After: Struct of Arrays (SoA) - better cache locality
struct Entities {
positions: Vec<Vec3>,
velocities: Vec<Vec3>,
health: Vec<f32>,
}
// Process all positions together (cache-friendly)
fn update_positions(entities: &mut Entities, dt: f32) {
for (pos, vel) in entities.positions.iter_mut().zip(&entities.velocities) {
*pos += *vel * dt;
}
}
use std::borrow::Cow;
// Parse without copying when possible
struct ParsedData<'a> {
name: Cow<'a, str>,
values: &'a [u8],
}
fn parse(input: &[u8]) -> Result<ParsedData<'_>> {
// Borrow from input when no transformation needed
// Only allocate when escaping/decoding required
}
// Use portable-simd or explicit intrinsics
use std::simd::{f32x8, SimdFloat};
fn sum_simd(data: &[f32]) -> f32 {
let chunks = data.chunks_exact(8);
let remainder = chunks.remainder();
let sum = chunks
.map(|chunk| f32x8::from_slice(chunk))
.fold(f32x8::splat(0.0), |acc, x| acc + x)
.reduce_sum();
sum + remainder.iter().sum::<f32>()
}
// Use string interning for repeated strings
use string_interner::{StringInterner, DefaultSymbol};
struct Interned {
interner: StringInterner,
}
impl Interned {
fn intern(&mut self, s: &str) -> DefaultSymbol {
self.interner.get_or_intern(s)
}
}
// Use CompactString for small strings
use compact_str::CompactString;
let small: CompactString = "hello".into(); // No heap allocation
use crossbeam::epoch::{self, Atomic, Owned};
use std::sync::atomic::Ordering;
// Epoch-based reclamation: safe memory management without GC
struct ConcurrentStack<T> {
head: Atomic<Node<T>>,
}
struct Node<T> {
data: T,
next: Atomic<Node<T>>,
}
impl<T> ConcurrentStack<T> {
fn push(&self, data: T) {
let mut node = Owned::new(Node {
data,
next: Atomic::null(),
});
let guard = epoch::pin();
loop {
let head = self.head.load(Ordering::Relaxed, &guard);
node.next.store(head, Ordering::Relaxed);
match self.head.compare_exchange(
head, node, Ordering::Release, Ordering::Relaxed, &guard,
) {
Ok(_) => break,
Err(e) => node = e.new,
}
}
}
}
// Concurrent queue for producer-consumer pipelines
use crossbeam::queue::ArrayQueue;
let queue = ArrayQueue::new(1024);
// Producer: queue.push(item) -- returns Err if full
// Consumer: queue.pop() -- returns None if empty
use rayon::prelude::*;
// Simple parallel iteration
fn process_all(items: &mut [Item]) {
items.par_iter_mut().for_each(|item| {
item.transform();
});
}
// Parallel chunking for better cache locality
fn sum_parallel(data: &[f64]) -> f64 {
data.par_chunks(1024)
.map(|chunk| chunk.iter().sum::<f64>())
.sum()
}
// Custom thread pool for isolated workloads
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(4)
.thread_name(|i| format!("search-worker-{}", i))
.stack_size(8 * 1024 * 1024)
.build()
.unwrap();
pool.install(|| {
// All rayon operations here use this pool
data.par_iter().for_each(|item| process(item));
});
use std::sync::atomic::{AtomicU64, AtomicBool, Ordering};
// Ordering guide:
// Relaxed -- No ordering guarantees. Counters, statistics.
// Acquire -- Reads see all writes before the paired Release.
// Release -- Writes become visible to paired Acquire reads.
// AcqRel -- Both Acquire and Release. Read-modify-write ops.
// SeqCst -- Total global ordering. Rarely needed, highest cost.
struct Metrics {
request_count: AtomicU64, // Relaxed: just a counter
is_ready: AtomicBool, // Acquire/Release: guards initialization
}
impl Metrics {
fn increment(&self) {
self.request_count.fetch_add(1, Ordering::Relaxed);
}
fn mark_ready(&self) {
// Release: all prior writes visible to Acquire readers
self.is_ready.store(true, Ordering::Release);
}
fn wait_ready(&self) {
// Acquire: sees all writes before the Release store
while !self.is_ready.load(Ordering::Acquire) {
std::hint::spin_loop();
}
}
}
// BAD: Adjacent atomics on the same cache line cause contention
struct BadCounters {
counter_a: AtomicU64, // Same 64-byte cache line as counter_b
counter_b: AtomicU64,
}
// GOOD: Pad to separate cache lines
#[repr(align(64))]
struct PaddedCounter {
value: AtomicU64,
}
struct GoodCounters {
counter_a: PaddedCounter, // Own cache line
counter_b: PaddedCounter, // Own cache line
}
loom and ThreadSanitizer are mandatory for lock-free code; stress tests with concurrent access# Cargo.toml -- additional profiles beyond release/release-lto
[profile.release-small]
inherits = "release"
opt-level = "z" # Optimize for binary size
strip = "symbols" # Remove symbol table
panic = "abort" # No unwinding machinery
codegen-units = 1 # Better optimization, slower compile
[profile.release-wasm]
inherits = "release"
opt-level = "s" # Balance size and speed for WASM
lto = true
[profile.release]
strip = "none" # Keep everything (debugging)
# strip = "debuginfo" # Remove debug info, keep symbols (profiling)
# strip = "symbols" # Remove all symbols (distribution)
When to use each:
"none" -- Development, debugging, profiling"debuginfo" -- Production with profiling capability (flamegraphs still work)"symbols" -- Final distribution binaries (smallest size)# Cargo.toml
[features]
default = ["tls"]
tls = ["dep:rustls"]
simd = [] # Enable SIMD code paths
jemalloc = ["dep:tikv-jemallocator"]
# Reduce binary size by making features optional
full = ["tls", "simd", "jemalloc"]
minimal = [] # No optional features
// Use cfg to conditionally compile
#[cfg(feature = "simd")]
fn process_fast(data: &[u8]) -> Vec<u8> {
// SIMD implementation
}
#[cfg(not(feature = "simd"))]
fn process_fast(data: &[u8]) -> Vec<u8> {
// Scalar fallback
}
cross# Install cross (uses Docker for cross-compilation)
cargo install cross
# Build for Linux from macOS
cross build --release --target x86_64-unknown-linux-gnu
# Build for ARM (Raspberry Pi)
cross build --release --target aarch64-unknown-linux-gnu
# Build for Windows from macOS/Linux
cross build --release --target x86_64-pc-windows-gnu
// jemalloc for better multithreaded allocation performance
#[cfg(feature = "jemalloc")]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
// Arena allocation for request-scoped data
use bumpalo::Bump;
fn handle_request(data: &[u8]) -> Response {
let arena = Bump::new();
// All allocations freed at once when arena drops
let parsed = arena.alloc(parse(data));
let transformed = arena.alloc(transform(parsed));
build_response(transformed)
}
// Likely/unlikely branch hints
#[cold]
fn handle_error() { ... }
// Force inlining
#[inline(always)]
fn hot_function() { ... }
// Prevent inlining
#[inline(never)]
fn cold_function() { ... }
// Enable specific optimizations
#[target_feature(enable = "avx2")]
unsafe fn simd_process() { ... }
// Check struct size and alignment
println!("Size: {}", std::mem::size_of::<MyStruct>());
println!("Align: {}", std::mem::align_of::<MyStruct>());
// Optimize field ordering to reduce padding
#[repr(C)]
struct Optimized {
large: u64, // 8 bytes
medium: u32, // 4 bytes
small: u16, // 2 bytes
tiny: u8, // 1 byte
_pad: u8, // explicit padding
}
Before submitting a performance-related PR:
[ ] Regression tests added/extended for changed code paths
[ ] Tests pass BEFORE benchmarking
[ ] Benchmark script included (Criterion or hyperfine)
[ ] Before/after numbers on same machine
[ ] Build profile explicitly noted (release, release-lto, etc.)
[ ] If >50% improvement: flamegraph/perf evidence included
[ ] If unsafe code: invariants documented + tests proving them