From harness-claude
Detects load test infrastructure (k6, Artillery, Gatling, JMeter), designs scenarios for critical endpoints, executes stress tests, and analyzes results against thresholds. For pre-release validation and scaling.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Stress testing, capacity planning, and performance benchmarking with k6, Artillery, and Gatling. Detects existing load test infrastructure, designs test scenarios for critical paths, executes tests, and analyzes results against defined thresholds.
Creates and runs load tests with k6, JMeter, and Artillery for web apps and APIs. Validates performance under stress, spike, soak, scalability to detect bottlenecks.
Generates k6, Artillery, wrk scripts for API load/stress/soak tests to validate performance, identify bottlenecks, and establish baselines under configurable loads.
Executes load, stress, spike, and soak tests with k6, Artillery, JMeter, Locust, and autocannon to find bottlenecks and check SLAs.
Share bugs, ideas, or general feedback.
Stress testing, capacity planning, and performance benchmarking with k6, Artillery, and Gatling. Detects existing load test infrastructure, designs test scenarios for critical paths, executes tests, and analyzes results against defined thresholds.
Discover load testing tooling. Scan the project for load test infrastructure:
*.k6.js, k6/ directory, import { check } from 'k6' patternsartillery.yml, artillery/ directory, config.target in YAML filesgatling/, *Simulation.scala, gatling.conf*.jmx files, jmeter/ directoryload-tests/, perf/, benchmark/ directoriesInventory existing test scenarios. For each discovered test file:
Map critical endpoints. Identify endpoints that should be load tested:
git log --oneline --since="30 days ago"Identify coverage gaps. Compare critical endpoints against existing test scenarios:
Check test infrastructure. Verify the testing environment:
Select test profiles. Design scenarios for each critical endpoint based on the testing goal:
Define virtual user profiles. Model realistic user behavior:
Set performance thresholds. Define pass/fail criteria per endpoint:
Generate test scripts. Produce test files in the detected tool format:
stages, thresholds, and checksphases, ensure, and scenario flowDesign ramp-up stages. For each profile, define the VU ramp schedule:
Validate test environment. Before executing:
Run smoke test first. Execute each test script with minimal load:
Execute the selected test profile. Run the load test and capture:
k6 run --out json=results.json test.k6.jsartillery run --output report.json artillery.ymlMonitor system resources during execution. If accessible:
Capture baseline or compare to previous run. Store results for trend analysis:
load-tests/results/YYYY-MM-DD-<profile>.jsonEvaluate against thresholds. For each endpoint tested:
Identify bottlenecks. Correlate performance data with system metrics:
Calculate capacity projections. Based on stress test results:
Compare to baseline. If previous results exist:
Produce the load test report. Output a structured summary:
Load Test Report: <profile> — <date>
Target: <URL> | Tool: <k6/Artillery/Gatling> | Duration: <time>
Commit: <SHA>
Results:
Endpoint | p50 | p95 | p99 | Errors | RPS | Status
GET /api/products | 45ms | 120ms | 340ms | 0.1% | 850 | PASS
POST /api/orders | 180ms | 520ms | 1100ms | 0.8% | 120 | WARN (p99 > 1000ms)
GET /api/search | 95ms | 680ms | 2100ms | 2.3% | 340 | FAIL (p99 > 1000ms, errors > 1%)
Capacity: Max sustainable 1200 RPS at p95 < 500ms (current production: ~400 RPS, 3x headroom)
Bottleneck: /api/search becomes I/O-bound at 500 RPS — database full-text search query
Recommendation: Add search index or migrate to Elasticsearch for /api/search
Archive results. Save the full report and raw data for historical comparison.
harness skill run harness-load-testing -- Primary CLI entry point. Runs all four phases.harness validate -- Run after generating test scripts to verify project structure.harness check-deps -- Verify load testing tool dependencies are declared (k6, artillery npm package).emit_interaction -- Used before execution (checkpoint:human-verify) to confirm target environment and test profile.Glob -- Discover existing load test files, result archives, and configuration.Grep -- Search for endpoint definitions, route handlers, and threshold configurations.Write -- Generate load test scripts and result reports.Edit -- Update existing test scripts with new scenarios or adjusted thresholds.Phase 1: DETECT
Tool: k6 (found k6/ directory with 2 existing scripts)
Existing tests:
- k6/smoke-api.k6.js: GET /api/health (1 VU, 10s)
- k6/load-products.k6.js: GET /api/products (50 VUs, 5m, p95 < 300ms)
Coverage gaps:
- POST /api/orders — revenue-critical, no load test
- GET /api/search — high-traffic, no load test
- No stress or soak test profiles exist
Phase 2: DESIGN
New scenarios:
- k6/load-orders.k6.js: POST /api/orders
Stages: ramp to 100 VUs over 2m, hold 5m, ramp down 1m
Thresholds: p95 < 800ms, errors < 1%, RPS > 80
- k6/stress-api.k6.js: All endpoints
Stages: step-up from 50 to 500 VUs in 50-VU increments, 2m per step
Thresholds: find breaking point, record p95 at each step
- k6/soak-api.k6.js: Critical endpoints at expected load
Duration: 2 hours at 200 VUs
Thresholds: p95 < 500ms, memory growth < 50MB/hour
Phase 3: EXECUTE
Environment: https://staging.example.com (verified non-production)
Smoke: All scripts pass with 1 VU
Load test results captured to load-tests/results/2026-03-27-load.json
Phase 4: ANALYZE
Results: POST /api/orders p95=620ms (PASS), GET /api/search p99=2100ms (FAIL)
Bottleneck: Full-text search on PostgreSQL LIKE query at 300+ RPS
Capacity: 800 RPS sustainable, current production 250 RPS (3.2x headroom)
Recommendation: Add pg_trgm index or migrate search to Elasticsearch
Phase 1: DETECT
Tool: Artillery (found artillery.yml with 1 scenario)
Existing: query { products } at 20 RPS for 60s
Gaps: mutations not tested, no spike profile, thresholds not defined
Phase 2: DESIGN
New config: artillery/graphql-load.yml
Phases:
- warm-up: 5 RPS for 30s
- load: 50 RPS for 5m
- spike: jump to 200 RPS for 30s, back to 50 RPS
Scenarios:
- query { products(limit: 20) } — 60% weight
- mutation { createOrder(input: $input) } — 25% weight
- query { user(id: $id) { orders } } — 15% weight
Ensure:
- p99 < 1500ms
- maxErrorRate < 2
Phase 3: EXECUTE
Target: http://staging.internal:3000/graphql
Smoke: PASS (all queries resolve, auth tokens valid)
Full run: artillery run --output report.json artillery/graphql-load.yml
Phase 4: ANALYZE
Results:
query products: p95=89ms, p99=210ms — PASS
mutation createOrder: p95=340ms, p99=890ms — PASS
query user.orders: p95=520ms, p99=2400ms — FAIL
Bottleneck: N+1 query in user.orders resolver (no DataLoader)
Spike recovery: System recovered to baseline within 15s after spike — PASS
Recommendation: Add DataLoader for orders resolver, re-test after fix
| Rationalization | Reality |
|---|---|
| "The smoke test passed, so the full load test will probably be fine too." | A smoke test at 1-2 VUs tells you the script runs — it says nothing about behavior at 100 or 1000 VUs. Connection pool exhaustion, lock contention, and GC pressure only appear under load. Smoke passing is the floor, not the ceiling. |
| "Staging is smaller than production, so results won't be accurate anyway — no point running the full test." | Staging results are always useful as a proxy: they reveal algorithmic bottlenecks, N+1 queries, and missing indexes that scale identically regardless of instance count. Document the scale factor and use it. Do not skip testing because the environment is imperfect. |
| "We haven't changed the API, so the old load test baselines still apply." | Baselines go stale when dependencies update, traffic patterns shift, or adjacent services change. A deployment that adds one middleware layer or changes a database index can move p99 by 200ms. Baselines must be re-validated, not assumed. |
| "The p95 threshold is arbitrary — let's just relax it until the test passes." | A threshold without a documented basis is a guess. A threshold lowered to make a failing test pass is a suppressed regression. Thresholds must be derived from SLOs or measured baselines. If the SLO is wrong, change the SLO explicitly with stakeholder sign-off. |
| "We'll run the soak test later — we just need to ship the load test first." | Soak tests catch failures that only emerge over hours: memory leaks, connection pool exhaustion, log file growth. If the feature involves a long-lived process, background worker, or WebSocket, skipping the soak test means the failure surfaces in production. |
[checkpoint:human-verify] must be passed with documented approval.