Help us improve
Share bugs, ideas, or general feedback.
From harness-evolver
Verifies Evolver agent's score stability by running 3 evaluations on current code, computing mean ± std from combined_scores, and reporting STABLE/MARGINAL/UNSTABLE verdict.
npx claudepluginhub raphaelchristi/harness-evolver --plugin harness-evolverHow this skill is triggered — by the user, by Claude, or both
Slash command
/harness-evolver:certifyThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Verify score stability by running evaluation multiple times and reporting statistical confidence.
Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.
Share bugs, ideas, or general feedback.
Verify score stability by running evaluation multiple times and reporting statistical confidence.
TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"
Read .evolver.json to get the best experiment and dataset.
Run evaluation 3 times on the current code (not a worktree — the best code is already merged):
for i in 1 2 3; do
$EVOLVER_PY $TOOLS/run_eval.py \
--config .evolver.json \
--worktree-path "." \
--experiment-prefix "certify-run-$i"
done
After all 3 runs complete, read results and compute statistics:
$EVOLVER_PY $TOOLS/read_results.py --experiments "certify-run-1-{suffix},certify-run-2-{suffix},certify-run-3-{suffix}" --config .evolver.json --format summary
Calculate mean and standard deviation from the 3 combined_scores.
CERTIFICATION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runs: 3
Mean: {mean:.3f}
Std: {std:.3f}
Range: {min:.3f} — {max:.3f}
Verdict: {STABLE|UNSTABLE}
STABLE (std < 0.05): Score is reliable. The agent performs consistently.
MARGINAL (0.05 <= std < 0.10): Score varies moderately. Consider adding rubrics to reduce judge variance.
UNSTABLE (std >= 0.10): Score is unreliable. The LLM judge interprets criteria differently across runs. Add few-shot examples or tighter rubrics.
If STABLE: suggest /harness:deploy to finalize.
If UNSTABLE: suggest adding rubrics to dataset examples, or running /harness:evolve with heavy mode for more thorough evaluation.