From performance-engineer
Design a load test plan — define scenarios, configure realistic load patterns, script tests, and define success criteria.
npx claudepluginhub hpsgd/turtlestack --plugin performance-engineerThis skill is limited to using the following tools:
Design a load test plan for $ARGUMENTS.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Designs, implements, and audits WCAG 2.2 AA accessible UIs for Web (ARIA/HTML5), iOS (SwiftUI traits), and Android (Compose semantics). Audits code for compliance gaps.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Design a load test plan for $ARGUMENTS.
Before writing any test scripts, understand what you are testing and what "normal" looks like.
If no traffic data exists, estimate from user count and expected usage patterns. Document the assumption.
Design tests for each type. Do not skip any — each type reveals different problems.
| Test type | Purpose | Duration | Load pattern | What it reveals |
|---|---|---|---|---|
| Baseline | Establish normal performance | 5 minutes | Current production load | What "good" looks like — your comparison point |
| Stress | Find the breaking point | 15 minutes | Ramp from 1x to 10x current load | Where errors start, which component fails first |
| Endurance | Find slow leaks and degradation | 1–4 hours | Sustained 2x load | Memory leaks, connection pool exhaustion, log disk filling |
| Spike | Test auto-scaling and recovery | 10 minutes | Sudden 5x spike, then return to normal | Recovery time, auto-scaling behaviour, queue backlog clearance |
For each scenario, define:
Test with production-like data. Empty databases lie about performance.
| Requirement | Why it matters |
|---|---|
| Production-like data volume | Query performance degrades with table size. 100 rows ≠ 10M rows |
| Realistic data distribution | Hotspots, popular items, skewed access patterns affect caching and indexing |
| Diverse user profiles | Different users have different data volumes (power users vs new users) |
| Representative payloads | Request and response sizes affect network, serialisation, and memory |
Rules:
Preferred: k6 — scriptable, CI-friendly, JavaScript-based, built-in metrics. Alternative: Locust — Python-native, distributed by default.
k6 script skeleton:
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
baseline: {
executor: 'constant-vus',
vus: 50, // adjust to current production load
duration: '5m',
},
stress: {
executor: 'ramping-vus',
startVUs: 0,
stages: [
{ duration: '2m', target: 50 }, // ramp to baseline
{ duration: '5m', target: 200 }, // ramp to 4x
{ duration: '5m', target: 500 }, // ramp to 10x
{ duration: '3m', target: 0 }, // ramp down
],
startTime: '6m', // start after baseline completes
},
},
thresholds: {
http_req_duration: ['p(95)<500'], // p95 < 500ms
http_req_failed: ['rate<0.01'], // error rate < 1%
},
};
export default function () {
const res = http.get('https://api.example.com/endpoint');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1); // think time — real users don't fire requests without pause
}
Define pass/fail thresholds BEFORE running the tests:
| Metric | Target | Enforcement |
|---|---|---|
| p50 response time | < 200ms for API, < 1s for page load | k6 threshold |
| p95 response time | < 500ms for API, < 3s for page load | k6 threshold — build fails if exceeded |
| p99 response time | < 1s for API, < 5s for page load | k6 threshold |
| Throughput | Sustain 3x current load without degradation | Stress test verification |
| Error rate | < 0.1% under normal load, < 1% under stress | k6 threshold |
| CPU utilisation | < 70% at normal load | Monitoring during test |
| Memory utilisation | Stable (no upward trend during endurance test) | Monitoring during test |
| Requirement | Why |
|---|---|
| Isolated environment | Shared staging gives shared noise. Results are meaningless if other tests are running |
| Production-like sizing | Testing on a single-node dev instance tells you nothing about production |
| Monitoring active | CPU, memory, disk I/O, network, database connections — all must be observable during the test |
| Pre-flight check | Before running: verify environment is clean, no existing load, baseline metrics are normal |
| Item | Detail |
|---|---|
| When to run | Off-peak for shared environments. Any time for isolated environments |
| Who monitors | Someone watches dashboards during the test — automated tests need human observation for unexpected patterns |
| Results storage | k6 Cloud, InfluxDB + Grafana, or JSON output committed to repo |
| Comparison baseline | Every run is compared to the previous baseline. Regressions are flagged automatically |
# Load Test Plan: [target system/endpoint]
## Target
- **System:** [what is being tested]
- **Current load:** [requests/sec, concurrent users]
- **Data profile:** [database size, key table counts]
- **Dependencies:** [external services called]
## Scenarios
| Scenario | VUs | Duration | Ramp | Success criteria |
|---|---|---|---|---|
| Baseline | [n] | 5m | None | p95 < 500ms, errors < 0.1% |
| Stress | [n→10n] | 15m | Linear | Find breaking point, graceful degradation |
| Endurance | [2n] | 2h | None | No memory leak, stable latency |
| Spike | [5n sudden] | 10m | Step | Recovery < 2 minutes |
## Thresholds
| Metric | Baseline | Stress | Endurance | Spike |
|---|---|---|---|---|
| p95 response | < 500ms | < 2s | < 500ms (stable) | < 500ms (post-recovery) |
| Error rate | < 0.1% | < 5% | < 0.1% | < 1% (post-recovery) |
| CPU | < 70% | documented | < 70% (stable) | recovers to < 70% |
## Environment
- **Target:** [URL/endpoint]
- **Data:** [production-like, [n] records]
- **Monitoring:** [tools in use]
- **Isolation:** [dedicated/shared]
## Schedule
- **Date:** [when]
- **Monitor:** [who watches]
- **Results:** [where stored]
/performance-engineer:capacity-plan — use load test results to validate or update capacity plans./performance-engineer:performance-profile — when load tests reveal bottlenecks, profile the specific endpoints to find root causes.