Help us improve
Share bugs, ideas, or general feedback.
From builder-ai
Use before merging, deploying, or demo'ing any LLM feature. Requires documented eval results — pass rate, failure analysis, baseline comparison. Blocks "it looked good when I tested it" completions.
npx claudepluginhub rbraga01/a-team --plugin builder-aiHow this skill is triggered — by the user, by Claude, or both
Slash command
/builder-ai:eval-before-shipThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
Share bugs, ideas, or general feedback.
AN LLM FEATURE IS NOT READY UNTIL NUMBERS EXIST.
"It looked good when I tested it" is not an eval.
"I ran a few examples and it worked" is not an eval.
A named suite, a defined metric, a pass rate, a failure analysis,
and a baseline comparison IS an eval. All five. Not four.
Trigger before any of these:
An eval must have all five components:
| Component | What It Means | What Does NOT Count |
|---|---|---|
| Named test suite | File with labelled examples in evals/ | "I tested it manually" |
| Defined metric | Accuracy %, faithfulness score, task pass rate | "It seemed accurate" |
| Pass threshold | Explicit minimum (e.g., ≥ 85%) | No threshold = no standard |
| Failure analysis | ≥ 5 failures examined and categorised | "There were a few errors" |
| Baseline comparison | This version vs. previous version or control | First release exempt; all subsequent require it |
Answer these before touching the prompt:
If you cannot answer these before building, the task is not well enough specified to build.
evals/
<feature-name>/
test-set.jsonl ← labelled examples, one JSON object per line
harness.py ← eval runner
results-<date>.md ← documented results
test-set.jsonl format:
{"id": "001", "input": "...", "expected": "...", "tags": ["edge-case"]}
Label ground truth before running any model. Seeing outputs first contaminates labels.
Minimum 50 examples; 200+ for features used at volume. Include: edge cases, short inputs, long inputs, ambiguous inputs, adversarial rephrasing.
python evals/<feature-name>/harness.py --model <model-id> --seed 42
Record: pass rate (X/N = Y%), latency p50/p95, failure category breakdown.
Examine every failing case. Assign each to one category:
The most common failure category is your next fix.
Suite: evals/email-classifier/test-set.jsonl (200 examples)
Model: claude-sonnet-4-6, temperature: 0.0, seed: 42
Pass rate: 178/200 = 89% (threshold: ≥ 85% ✓)
Failure breakdown:
- Format violation: 12 (long emails > 2000 tokens)
- Hallucination: 10 (invented labels not in schema)
Baseline: v1 (82%) → v2 (89%), delta: +7pp ✓
Results: evals/email-classifier/results-2026-06-07.md
These thoughts mean you have NOT completed an eval — stop and build one:
When eval-before-ship is satisfied, state it like this:
Eval complete.
Suite: evals/<feature>/test-set.jsonl — N examples
Model: <model-id>, temperature: X, seed: Y
Pass rate: A/N = B% (threshold: ≥ C% ✓)
Top failure mode: <category> (N cases — <root cause>)
Failure analysis: evals/<feature>/failure-analysis-<date>.md ✓
Baseline: <previous> = X% → this version = Y%, delta: +Zpp ✓
[First release: baseline field may be marked "N/A — initial version"]
Results stored: evals/<feature>/results-<date>.md ✓
The pass rate, failure analysis, and baseline comparison are not optional. On the first release, baseline may be marked "N/A — initial version"; for all subsequent releases it is required.
LLM outputs are stochastic. A prompt that works on the examples you tried does not predict performance on the next 1000 inputs. Eval suites catch regressions before users do — and they make it possible to improve without guessing.