From duet
Profiling methodology and optimization strategy for performance work. Use when the user asks to "make this faster", "optimize", "profile", "reduce latency", "fix slow", "improve throughput", or when investigating performance regressions.
npx claudepluginhub tslateman/claude-plugins --plugin duetThis skill uses the workspace's default tool permissions.
Optimize what you've measured, not what you suspect. Performance work without profiling is superstition. Measure first, hypothesize second, optimize third, measure again.
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Optimize what you've measured, not what you suspect. Performance work without profiling is superstition. Measure first, hypothesize second, optimize third, measure again.
Never skip from step 1 to step 5.
Every optimization trades one resource for another. Make the trade explicit.
| Trade-off | Example |
|---|---|
| Latency vs. throughput | Batching increases throughput, raises individual latency |
| Memory vs. CPU | Caching trades memory for fewer computations |
| Simplicity vs. speed | Hand-rolled loops beat abstractions but obscure intent |
| Startup vs. runtime | Lazy loading delays startup cost to first use |
| Bandwidth vs. latency | Compression saves bandwidth, costs CPU time |
| Consistency vs. speed | Eventual consistency is faster than strong consistency |
Ask: "Which resource is scarce in this context?" Optimize for the scarce one.
Start with the outermost measurement, narrow inward:
| Tool type | Reveals | Misses |
|---|---|---|
| Wall-clock timer | Total duration | Where time is spent |
| CPU profiler | Hot functions | I/O waits, lock contention |
| Memory profiler | Allocations, leaks | Cache effects |
| Flame graph | Call hierarchy costs | Inlined functions |
| Tracing | Request flow, latency | Aggregate behavior |
| Load testing | Throughput limits | Root cause of limits |
Use at least two tool types. A CPU profiler won't find a database bottleneck.
Symptoms: Low CPU usage, high wait times, slow under load.
Causes: Synchronous I/O, N+1 queries, unbatched requests, no connection pooling.
Remedies: Batch operations, add caching, use async I/O, pool connections.
Symptoms: High CPU usage, scales with input size, unaffected by I/O improvements.
Causes: Inefficient algorithms, unnecessary computation, poor data structures.
Remedies: Better algorithms first (O(n) beats optimized O(n^2)), then micro-optimize the hot path.
Symptoms: Growing memory usage, GC pauses, OOM errors, cache thrashing.
Causes: Unbounded caches, leaked references, large intermediate allocations, fragmentation.
Remedies: Bound caches (LRU), stream instead of buffer, pool allocations, reduce object size.
Symptoms: Low individual resource usage but poor throughput under concurrency.
Causes: Lock contention, shared mutable state, thread pool exhaustion, connection limits.
Remedies: Reduce critical section scope, use lock-free structures, partition state, increase pool size.
Premature optimization — Optimizing before measuring. The bottleneck is never where you think.
Micro-benchmarking in isolation — Benchmarking a function outside its real context misses cache effects, GC pressure, and contention.
Optimizing the wrong metric — Reducing P50 latency when users complain about P99. Improving throughput when the problem is startup time.
Death by a thousand cuts — No single bottleneck, just accumulated inefficiency. Profile holistically, not function-by-function.
Caching without invalidation strategy — Cache speeds reads but stale data causes correctness bugs. Define TTL and invalidation before adding a cache.
When analyzing performance:
## Performance Analysis
### Goal
[Specific metric and target: "Reduce API P99 latency from 800ms to 200ms"]
### Baseline
[Current measurements with methodology]
### Profile Summary
[Where time/memory/resources go, ranked by impact]
### Recommendations
1. [Change] — [Expected improvement] — [Trade-off]
2. [Change] — [Expected improvement] — [Trade-off]
### Not Optimizing
[What was considered but rejected, and why]
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered."
Optimize the critical 3%, not the other 97%.
/debugging — Performance regressions are bugs; profiling is debugging for speedskills/FRAMEWORKS.md — Full framework indexRECIPE.md — Agent recipe for parallel decomposition (2 workers)