From go-dev
Profiles and optimizes Go code for CPU hotspots, memory allocations, and concurrency using pprof, benchmarks, benchstat, and statistical verification.
npx claudepluginhub gopherguides/gopher-ai --plugin go-devThis skill is limited to using the following tools:
<!-- cache:start -->
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Persona: You are a Go performance engineer. You measure before you optimize, you optimize one thing at a time, and you verify with statistical rigor.
Modes:
Principle: "Never optimize without profiling data. You will optimize the wrong thing."
For hands-on profiling with automatic bottleneck detection and optimization, use /profile <target>.
NEVER optimize without profiling data. Every optimization must be driven by evidence.
Baseline → Profile → Identify Bottleneck → Isolate → Optimize → Verify → Repeat
go test -bench=. -benchmem -count=6benchstat old.bench new.bench| Profile | Flag | Use When |
|---|---|---|
| CPU | -cpuprofile=cpu.pprof | Function is slow, high CPU usage |
| Memory (heap) | -memprofile=mem.pprof | High memory usage, GC pressure, many allocations |
| Block | -blockprofile=block.pprof | Goroutines blocked on channels or mutexes |
| Mutex | -mutexprofile=mutex.pprof | Lock contention suspected |
| Goroutine | runtime/pprof.Lookup("goroutine") | Goroutine leaks, too many goroutines |
| Trace | -trace=trace.out | Scheduling latency, GC pauses, concurrency issues |
Start with CPU profiling. If allocations are high (check allocs/op in benchmarks), add memory profiling. Use trace only for concurrency investigation.
Always sort by cumulative (-cum) to find where time is actually spent. A function with low flat but high cum is calling expensive children.
go tool pprof -list=FunctionName cpu.pprof
Shows source code with per-line timing. Time values on the LEFT show how much each line costs. This is your most powerful diagnostic — it tells you the EXACT lines to optimize.
go tool pprof -peek=FunctionName cpu.pprof
Shows who calls the hot function and what it calls. Useful for understanding the full hot path.
go build -gcflags="-m" ./...
Shows the compiler's allocation decisions:
escapes to heap — variable allocated on heap (costs GC time)does not escape — stays on stack (free, automatic cleanup)moved to heap — compiler couldn't prove it stays in scopeinterface{} / any (interface boxing)// Escapes — returns pointer, forces heap allocation
func newThing() *Thing {
t := Thing{Name: "x"}
return &t
}
// Stays on stack — caller owns the memory
func initThing(t *Thing) {
t.Name = "x"
}
// Avoid — grows slice multiple times, each grow allocates
items := []string{}
for _, v := range data {
items = append(items, v)
}
// Good — allocate once with known capacity
items := make([]string, 0, len(data))
for _, v := range data {
items = append(items, v)
}
// Avoid — each += allocates a new string
result := ""
for _, s := range parts {
result += s
}
// Good — single allocation
var b strings.Builder
for _, s := range parts {
b.WriteString(s)
}
result := b.String()
// Avoid — allocates buffer every call
func readByte(r io.Reader) (byte, error) {
var buf [1]byte
_, err := r.Read(buf[:])
return buf[0], err
}
// Good — caller provides reusable buffer
func readByte(r io.Reader, buf []byte) (byte, error) {
_, err := r.Read(buf)
return buf[0], err
}
// Avoid — each Read() call hits the OS
count := process(file)
// Good — bufio batches reads, dramatically fewer syscalls
br := bufio.NewReader(file)
count := process(br)
var bufPool = sync.Pool{
New: func() any {
return new(bytes.Buffer)
},
}
func process() {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
// use buf...
}
CRITICAL: sync.Pool objects may be collected at any GC cycle. Never rely on pool for correctness — only for performance.
// Wastes memory — padding between fields
type Bad struct {
a bool // 1 byte + 7 padding
b int64 // 8 bytes
c bool // 1 byte + 7 padding
} // = 24 bytes
// Compact — fields ordered by size descending
type Good struct {
b int64 // 8 bytes
a bool // 1 byte
c bool // 1 byte + 6 padding
} // = 16 bytes
// Avoid in hot loops — each call boxes the int into interface{}
fmt.Sprintf("%d", n)
// Good — no interface boxing
strconv.Itoa(n)
Available since Go 1.21. Uses production CPU profiles to guide compiler optimizations (inlining, devirtualization). Typical improvement: 7-14% CPU reduction.
default.pgo in the main package directorydefault.pgo# Collect production profile (30 seconds)
curl -o default.pgo 'http://localhost:6060/debug/pprof/profile?seconds=30'
# Rebuild with PGO (automatic — compiler finds default.pgo)
go build -o myapp ./cmd/myapp/
Use production profiles, NOT synthetic benchmarks. PGO optimizes the code paths that actually run in production.
Controls GC frequency. Default: GOGC=100 (GC runs when heap doubles).
GOGC=50 — GC runs more often, lower memory usage, higher CPU for GCGOGC=200 — GC runs less often, higher memory usage, lower CPU for GCGOGC=off — Disable GC entirely (batch jobs that exit quickly)Soft memory limit. GC becomes more aggressive as heap approaches this limit.
GOMEMLIMIT=512MiB ./myapp
CRITICAL: This is a SOFT limit. It does not prevent OOM if live heap exceeds the limit. It tells the GC to work harder to stay under the target.
GOGC=off (minimize GC overhead)Use runtime/trace when profiling shows the code isn't CPU-bound but is still slow — indicates goroutine scheduling issues, GC pauses, or contention.
go test -trace=trace.out -bench=. -run=^$ ./pkg/
Traces can be converted to pprof profiles for text-based analysis:
go tool trace -pprof=net trace.out > trace-net.pprof # Network blocking
go tool trace -pprof=sync trace.out > trace-sync.pprof # Sync blocking
go tool trace -pprof=syscall trace.out > trace-syscall.pprof # Syscall blocking
go tool trace -pprof=sched trace.out > trace-sched.pprof # Scheduler latency
When performance problems span multiple services, use OpenTelemetry for distributed tracing.
OpenTelemetry complements local profiling: pprof finds hotspots within a service, OTel finds which service is the bottleneck.
For production systems, consider always-on profiling:
Continuous profiling catches performance regressions that only appear under production load patterns.
unsafe) without measurementinterface{} / any in performance-critical paths without benchmarking the cost-count flag — a single benchmark run has no statistical validityruntime.ReadMemStats() frequently — it triggers a stop-the-world pauseb.Loop() (Go 1.24+) instead of for i := 0; i < b.N; i++ — prevents compiler
from optimizing away the benchmark target. For Go < 1.24, use a sink variable with b.Nb.ReportAllocs() or -benchmem to track allocations per operation-count=6 minimum for benchstat comparisons (more runs = better statistics)b.StopTimer()/b.StartTimer() to exclude setup from timingbenchstat old.bench new.bench — check p-value < 0.05 for significanceUnderstanding relative costs helps prioritize optimizations:
| Operation | Latency | Relative |
|---|---|---|
| L1 cache reference | 0.5 ns | 1x |
| Main memory reference | 100 ns | 200x |
| SSD random read (NVMe) | 100 µs | 200,000x |
| HDD disk seek | 10 ms | 20,000,000x |
| Network round-trip (same datacenter) | 500 µs | 1,000,000x |
| Network round-trip (cross datacenter) | 150 ms | 300,000,000x |
Focus on I/O and allocation patterns first. Nanosecond-level CPU optimizations rarely matter unless you're in a tight inner loop processing millions of items.
Powered by Gopher Guides training materials.