Help us improve
Share bugs, ideas, or general feedback.
From grimoire
Profiles system performance to identify bottlenecks using measurement before optimization — follows Brendan Gregg's USE Method and Google pprof methodology.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireHow this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:audit-performanceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Measure before optimizing. Profile systematically to find real bottlenecks — not guessed ones.
Orchestrates performance profiling and optimization across languages. Diagnoses symptoms, dispatches profiling agents, and manages before/after comparisons for latency, memory, CPU, and bundle issues.
Detects performance bottlenecks in CPU, memory, I/O, database, and lock contention layers. Provides analysis and remediation strategies for slow applications.
Measures and optimizes performance with data-driven profiling, identifying bottlenecks like N+1 queries, missing indexes, and synchronous I/O. Triggers on performance, speed, latency, profiling, or benchmark keywords.
Share bugs, ideas, or general feedback.
Measure before optimizing. Profile systematically to find real bottlenecks — not guessed ones.
Adopted by: Netflix (Brendan Gregg's USE Method, documented in "Systems Performance" 2020, used across Netflix engineering), Google (pprof is Google's internal profiler, open-sourced; used in production Go and C++ services), Meta (Pyroscope continuous profiling in production), Cloudflare (published profiling-first optimization case studies including 10× throughput gains on their Rust workers) Impact: Studies consistently show engineers guess wrong about where performance bottlenecks are: Jon Bentley's "Programming Pearls" (1986) found developers misidentify the hot path 90% of the time. Measure-first optimization at Google yielded 3–10× improvements in specific subsystems vs. intuition-driven rewrites that often produced no measurable gain (Google SRE Book, Chapter 19). Cloudflare's profiling-first approach on their TLS stack achieved 40% latency reduction by fixing a single memory allocation hotspot that no engineer had suspected. Why best: Premature optimization wastes engineering time on non-bottlenecks. The 90/10 rule (Knuth, 1974) holds empirically: 90% of time is spent in 10% of code. Only measurement identifies which 10%. Alternative (rewrite in a "faster" language) almost always loses to profile-and-fix in the same language for the same effort.
Sources: Brendan Gregg "Systems Performance" 2020, Google SRE Book (2016), Jon Bentley "Programming Pearls" (1986), Cloudflare engineering blog
Before profiling, write one sentence:
"[What] is [how slow/expensive] under [what conditions], measured by [what metric]."
Example: "The /search endpoint takes >2s at p99 under 100 RPS load, measured by our Datadog APM trace."
Without this, you don't know when you're done. Without a baseline, you can't measure improvement.
Measure before touching any code. Capture:
Use realistic load, not synthetic benchmarks. Tools:
wrk, k6, hey, vegetapgbench, sysbenchperf stat, time, language-native benchmarksRecord the baseline. You cannot claim improvement without before/after numbers.
For every resource in the system (CPU, memory, disk, network, each service):
High utilization + saturation = bottleneck. Start there.
# CPU utilization and saturation
top -b -n1 | head -20
mpstat -P ALL 1 3
# Memory
free -m
vmstat 1 5
# Disk I/O
iostat -x 1 5
# Network
ss -s
netstat -s | grep -E 'retransmit|error'
Go:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
Python:
python -m cProfile -o profile.out my_script.py
python -m pstats profile.out
# Or: py-spy top --pid <pid>
Node.js:
node --prof app.js
node --prof-process isolate-*.log > processed.txt
# Or: clinic.js flame -- node app.js
Java:
async-profiler -d 30 -f profile.html <pid>
Browser (frontend): Open Chrome DevTools → Performance → Record → reproduce the slow action → Stop. Look at the flame chart for long tasks (>50ms).
Read the flame chart: the widest frames are the hottest — work top-down. The actual bottleneck is at the bottom of the deepest wide column.
Distinguish:
# Go heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Node.js heap snapshot
node --inspect app.js # Then open Chrome DevTools → Memory → Take snapshot
Most application latency is database latency. Always check:
-- PostgreSQL: slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- PostgreSQL: missing indexes
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Look for: Seq Scan on large tables, high actual rows >> estimated rows
Signs of DB bottleneck:
Seq Scan on a table with >10k rows where an index should existactual rows far exceeds estimated rows → stale statistics → run ANALYZEAfter profiling, list findings:
| Bottleneck | Current cost | Fix | Estimated gain | Effort |
|---|---|---|---|---|
Missing index on orders.user_id | 800ms/query | Add index | ~750ms savings | Low |
| JSON serialization in hot loop | 40% CPU | Use binary format | ~30% CPU | Medium |
N+1 query in /users endpoint | 50 DB calls/req | Eager load | ~40 calls/req | Low |
Prioritize by: (estimated_gain × traffic_share) / effort. Fix the highest-value, lowest-effort items first.
For each fix:
If the fix shows no improvement, the bottleneck is elsewhere. Don't ship it as a "performance improvement."
Defining the problem correctly:
"The
POST /checkoutendpoint takes 3.2s p99 at 50 RPS under our staging load test, up from 800ms two weeks ago. Regression introduced in deploy #1847."
Reading pprof output: Flat% = time spent in this function itself. Cum% = time in this function + all callees. Fix functions with high flat% — they're burning CPU directly.
Spotting N+1:
# 1 query: SELECT * FROM orders WHERE user_id = 1
# Then for each order:
# SELECT * FROM products WHERE id = ? ← repeated 50 times
# Fix: SELECT o.*, p.* FROM orders o JOIN products p ON ...