Help us improve
Share bugs, ideas, or general feedback.
From go-dev
Profiles and optimizes Go code for CPU hotspots, memory allocations, and concurrency using pprof, benchmarks, benchstat, and statistical verification.
npx claudepluginhub gopherguides/gopher-ai --plugin go-devHow this skill is triggered — by the user, by Claude, or both
Slash command
/go-dev:go-profiling-optimizationThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- cache:start -->
Applies Go performance optimization patterns for allocation, CPU, memory layout, GC, pooling, caching, and hot-path optimization. Use when profiling identifies a bottleneck.
Orchestrates cross-language performance profiling and optimization: diagnoses symptoms, dispatches expert agents, benchmarks before/after changes.
Share bugs, ideas, or general feedback.
Persona: You are a Go performance engineer. You measure before you optimize, you optimize one thing at a time, and you verify with statistical rigor.
Modes:
Principle: "Never optimize without profiling data. You will optimize the wrong thing."
For hands-on profiling with automatic bottleneck detection and optimization, use /profile <target>.
NEVER optimize without profiling data. Every optimization must be driven by evidence.
Baseline → Profile → Identify Bottleneck → Isolate → Optimize → Verify → Repeat
go test -bench=. -benchmem -count=6benchstat old.bench new.bench| Profile | Flag | Use When |
|---|---|---|
| CPU | -cpuprofile=cpu.pprof | Function is slow, high CPU usage |
| Memory (heap) | -memprofile=mem.pprof | High memory usage, GC pressure, many allocations |
| Block | -blockprofile=block.pprof | Goroutines blocked on channels or mutexes |
| Mutex | -mutexprofile=mutex.pprof | Lock contention suspected |
| Goroutine | runtime/pprof.Lookup("goroutine") | Goroutine leaks, too many goroutines |
| Trace | -trace=trace.out | Scheduling latency, GC pauses, concurrency issues |
Start with CPU profiling. If allocations are high (check allocs/op in benchmarks), add memory profiling. Use trace only for concurrency investigation.
Always sort by cumulative (-cum) to find where time is actually spent. A function with low flat but high cum is calling expensive children.
go tool pprof -list=FunctionName cpu.pprof
Shows source code with per-line timing. Time values on the LEFT show how much each line costs. This is your most powerful diagnostic — it tells you the EXACT lines to optimize.
go tool pprof -peek=FunctionName cpu.pprof
Shows who calls the hot function and what it calls. Useful for understanding the full hot path.
go build -gcflags="-m" ./...
Shows the compiler's allocation decisions:
escapes to heap — variable allocated on heap (costs GC time)does not escape — stays on stack (free, automatic cleanup)moved to heap — compiler couldn't prove it stays in scopeinterface{} / any (interface boxing)// Escapes — returns pointer, forces heap allocation
func newThing() *Thing {
t := Thing{Name: "x"}
return &t
}
// Stays on stack — caller owns the memory
func initThing(t *Thing) {
t.Name = "x"
}
// Avoid — grows slice multiple times, each grow allocates
items := []string{}
for _, v := range data {
items = append(items, v)
}
// Good — allocate once with known capacity
items := make([]string, 0, len(data))
for _, v := range data {
items = append(items, v)
}
// Avoid — each += allocates a new string
result := ""
for _, s := range parts {
result += s
}
// Good — single allocation
var b strings.Builder
for _, s := range parts {
b.WriteString(s)
}
result := b.String()
// Avoid — allocates buffer every call
func readByte(r io.Reader) (byte, error) {
var buf [1]byte
_, err := r.Read(buf[:])
return buf[0], err
}
// Good — caller provides reusable buffer
func readByte(r io.Reader, buf []byte) (byte, error) {
_, err := r.Read(buf)
return buf[0], err
}
// Avoid — each Read() call hits the OS
count := process(file)
// Good — bufio batches reads, dramatically fewer syscalls
br := bufio.NewReader(file)
count := process(br)
var bufPool = sync.Pool{
New: func() any {
return new(bytes.Buffer)
},
}
func process() {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
// use buf...
}
CRITICAL: sync.Pool objects may be collected at any GC cycle. Never rely on pool for correctness — only for performance.
// Wastes memory — padding between fields
type Bad struct {
a bool // 1 byte + 7 padding
b int64 // 8 bytes
c bool // 1 byte + 7 padding
} // = 24 bytes
// Compact — fields ordered by size descending
type Good struct {
b int64 // 8 bytes
a bool // 1 byte
c bool // 1 byte + 6 padding
} // = 16 bytes
// Avoid in hot loops — each call boxes the int into interface{}
fmt.Sprintf("%d", n)
// Good — no interface boxing
strconv.Itoa(n)
Available since Go 1.21. Uses production CPU profiles to guide compiler optimizations (inlining, devirtualization). Typical improvement: 7-14% CPU reduction.
default.pgo in the main package directorydefault.pgo# Collect production profile (30 seconds)
curl -o default.pgo 'http://localhost:6060/debug/pprof/profile?seconds=30'
# Rebuild with PGO (automatic — compiler finds default.pgo)
go build -o myapp ./cmd/myapp/
Use production profiles, NOT synthetic benchmarks. PGO optimizes the code paths that actually run in production.
Controls GC frequency. Default: GOGC=100 (GC runs when heap doubles).
GOGC=50 — GC runs more often, lower memory usage, higher CPU for GCGOGC=200 — GC runs less often, higher memory usage, lower CPU for GCGOGC=off — Disable GC entirely (batch jobs that exit quickly)Soft memory limit. GC becomes more aggressive as heap approaches this limit.
GOMEMLIMIT=512MiB ./myapp
CRITICAL: This is a SOFT limit. It does not prevent OOM if live heap exceeds the limit. It tells the GC to work harder to stay under the target.
GOGC=off (minimize GC overhead)Use runtime/trace when profiling shows the code isn't CPU-bound but is still slow — indicates goroutine scheduling issues, GC pauses, or contention.
go test -trace=trace.out -bench=. -run=^$ ./pkg/
Traces can be converted to pprof profiles for text-based analysis:
go tool trace -pprof=net trace.out > trace-net.pprof # Network blocking
go tool trace -pprof=sync trace.out > trace-sync.pprof # Sync blocking
go tool trace -pprof=syscall trace.out > trace-syscall.pprof # Syscall blocking
go tool trace -pprof=sched trace.out > trace-sched.pprof # Scheduler latency
When performance problems span multiple services, use OpenTelemetry for distributed tracing.
OpenTelemetry complements local profiling: pprof finds hotspots within a service, OTel finds which service is the bottleneck.
For production systems, consider always-on profiling:
Continuous profiling catches performance regressions that only appear under production load patterns.
unsafe) without measurementinterface{} / any in performance-critical paths without benchmarking the cost-count flag — a single benchmark run has no statistical validityruntime.ReadMemStats() frequently — it triggers a stop-the-world pauseb.Loop() (Go 1.24+) instead of for i := 0; i < b.N; i++ — prevents compiler
from optimizing away the benchmark target. For Go < 1.24, use a sink variable with b.Nb.ReportAllocs() or -benchmem to track allocations per operation-count=6 minimum for benchstat comparisons (more runs = better statistics)b.StopTimer()/b.StartTimer() to exclude setup from timingbenchstat old.bench new.bench — check p-value < 0.05 for significanceUnderstanding relative costs helps prioritize optimizations:
| Operation | Latency | Relative |
|---|---|---|
| L1 cache reference | 0.5 ns | 1x |
| Main memory reference | 100 ns | 200x |
| SSD random read (NVMe) | 100 µs | 200,000x |
| HDD disk seek | 10 ms | 20,000,000x |
| Network round-trip (same datacenter) | 500 µs | 1,000,000x |
| Network round-trip (cross datacenter) | 150 ms | 300,000,000x |
Focus on I/O and allocation patterns first. Nanosecond-level CPU optimizations rarely matter unless you're in a tight inner loop processing millions of items.
Powered by Gopher Guides training materials.