From armory
Designs structured benchmarks comparing algorithms, models, or implementations with metrics, test cases, hardware context, and reproduction steps.
npx claudepluginhub mathews-tom/armory --plugin armoryThis skill uses the workspace's default tool permissions.
Standardizes performance comparison methodology: metric selection, test case design,
Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.
Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.
Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.
Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."
| File | Contents | Load When |
|---|---|---|
references/metric-selection.md | Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type | Always |
references/test-case-design.md | Representative input selection, scale variation, edge case coverage, warmup strategies | Always |
references/environment-capture.md | Hardware/software context recording, reproducibility requirements, variance control | Always |
references/statistical-rigor.md | Sample sizing, variance measurement, significance testing, outlier handling | Results need statistical validation |
Choose metrics that match the decision context:
| Metric Category | Specific Metrics | When Important |
|---|---|---|
| Latency | P50, P95, P99, mean, std dev | User-facing operations, API calls |
| Throughput | ops/sec, tokens/sec, MB/sec | Batch processing, streaming |
| Memory | Peak RSS, avg RSS, allocation rate | Resource-constrained environments |
| Accuracy | F1, BLEU, exact match, precision/recall | ML models, algorithms with quality tradeoffs |
| Cost | $/1K operations, $/hour, $/GB | Cloud services, API comparisons |
| Startup | Time to first operation, cold start | Serverless, CLI tools |
Select 2-4 metrics. More than 4 makes comparison tables unreadable.
Create a matrix of inputs that reveal performance characteristics:
Record everything needed to reproduce the results:
Produce comparison tables with clear winners per metric, followed by tradeoff analysis.
# Benchmark: {Descriptive Title}
**Date:** {YYYY-MM-DD}
**Hardware:** {CPU}, {RAM}, {GPU if applicable}
**Software:** {runtime versions}
**Configuration:** {key settings that affect results}
## Candidates
| # | Candidate | Version | Configuration |
|---|-----------|---------|---------------|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |
## Test Cases
| # | Name | Input Size | Description | Warmup | Iterations |
|---|------|------------|-------------|--------|------------|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |
## Results
### Latency (ms, lower is better)
| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|-----------|---------------------|---------------------|--------|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |
### Memory (MB, lower is better)
| Test Case | A (Peak) | B (Peak) | Winner |
|-----------|----------|----------|--------|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |
## Analysis
### Overall Winner
**{Candidate}** wins on {N} of {M} metrics across all test cases.
### Tradeoff Summary
- **Choose A when:** {conditions where A is the better choice}
- **Choose B when:** {conditions where B is the better choice}
### Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}
## Reproduction
```bash
# Environment setup
{commands to recreate the environment}
# Run benchmark
{commands to execute the benchmark}
## Configuring Scope
| Mode | Candidates | Depth | When to Use |
|------|-----------|-------|-------------|
| `quick` | 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| `standard` | 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| `rigorous` | Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |
## Calibration Rules
1. **Measure, don't guess.** Intuition about performance is unreliable. "Obviously
faster" is not a benchmark result.
2. **Apples to apples.** Candidates must be compared under identical conditions.
Different hardware, configuration, or input data invalidates the comparison.
3. **Report variance, not just means.** A mean of 50ms with std dev of 100ms is not
the same as a mean of 50ms with std dev of 2ms. Always report spread.
4. **Warm up before measuring.** First-run performance includes JIT, cache warming,
and connection setup. Exclude warmup iterations from results.
5. **Representative inputs only.** Benchmarking with synthetic best-case input is
misleading. Use data that resembles production workloads.
6. **State the winner per metric, not overall.** "A is better" is lazy. "A has lower
latency; B uses less memory" is useful.
## Error Handling
| Problem | Resolution |
|---------|------------|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |
## When NOT to Benchmark
Push back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead