Skill

harness:certify

Verifies Evolver agent's score stability by running 3 evaluations on current code, computing mean ± std from combined_scores, and reporting STABLE/MARGINAL/UNSTABLE verdict.

Python

Bash

testing

ai-ml

npx claudepluginhub raphaelchristi/harness-evolver --plugin harness-evolver

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-evolver:certify

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBashGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Verify score stability by running evaluation multiple times and reporting statistical confidence.

SKILL.md

64 lines · ~500 tokens

Similar Skills

harness:evolve

Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.

8 tools

harness-evolver

eval

Eval-driven development skill for AI workflows. Tracks pass@k metrics, capability and regression evals. Includes blind evaluation protocol for high-stakes scenarios.

1 file6 tools

kernel

eval

122

Evaluates code outputs on 4 axes—functionality/quality/originality/security—spawning an independent evaluator agent for scoring out of 100. Triggers on eval, 평가, quality score. Supports re-evaluation and idempotency tests.

4 tools

ccpp

Stats

LanguagePython

Stars21

Forks2

MaintenanceExcellent

Last CommitApr 18, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

/harness:certify

Verify score stability by running evaluation multiple times and reporting statistical confidence.

Resolve Tool Path

TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}" EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"

What To Do

Read .evolver.json to get the best experiment and dataset.

Run evaluation 3 times on the current code (not a worktree — the best code is already merged):

for i in 1 2 3; do $EVOLVER_PY $TOOLS/run_eval.py \ --config .evolver.json \ --worktree-path "." \ --experiment-prefix "certify-run-$i" done

After all 3 runs complete, read results and compute statistics:

$EVOLVER_PY $TOOLS/read_results.py --experiments "certify-run-1-{suffix},certify-run-2-{suffix},certify-run-3-{suffix}" --config .evolver.json --format summary

Calculate mean and standard deviation from the 3 combined_scores.

Report

CERTIFICATION REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Runs: 3 Mean: {mean:.3f} Std: {std:.3f} Range: {min:.3f} — {max:.3f} Verdict: {STABLE|UNSTABLE}

STABLE (std < 0.05): Score is reliable. The agent performs consistently.

MARGINAL (0.05 <= std < 0.10): Score varies moderately. Consider adding rubrics to reduce judge variance.

UNSTABLE (std >= 0.10): Score is unreliable. The LLM judge interprets criteria differently across runs. Add few-shot examples or tighter rubrics.

After Certification

If STABLE: suggest /harness:deploy to finalize. If UNSTABLE: suggest adding rubrics to dataset examples, or running /harness:evolve with heavy mode for more thorough evaluation.

/harness:certify

Verify score stability by running evaluation multiple times and reporting statistical confidence.

Resolve Tool Path

TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"

What To Do

Read .evolver.json to get the best experiment and dataset.

Run evaluation 3 times on the current code (not a worktree — the best code is already merged):

for i in 1 2 3; do
    $EVOLVER_PY $TOOLS/run_eval.py \
        --config .evolver.json \
        --worktree-path "." \
        --experiment-prefix "certify-run-$i"
done

After all 3 runs complete, read results and compute statistics:

$EVOLVER_PY $TOOLS/read_results.py --experiments "certify-run-1-{suffix},certify-run-2-{suffix},certify-run-3-{suffix}" --config .evolver.json --format summary

Calculate mean and standard deviation from the 3 combined_scores.

Report

CERTIFICATION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runs:  3
Mean:  {mean:.3f}
Std:   {std:.3f}
Range: {min:.3f} — {max:.3f}

Verdict: {STABLE|UNSTABLE}

STABLE (std < 0.05): Score is reliable. The agent performs consistently.

MARGINAL (0.05 <= std < 0.10): Score varies moderately. Consider adding rubrics to reduce judge variance.

UNSTABLE (std >= 0.10): Score is unreliable. The LLM judge interprets criteria differently across runs. Add few-shot examples or tighter rubrics.

After Certification

If STABLE: suggest /harness:deploy to finalize. If UNSTABLE: suggest adding rubrics to dataset examples, or running /harness:evolve with heavy mode for more thorough evaluation.

harness:certify

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

harness:certify

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

/harness:certify

Resolve Tool Path

What To Do

Report

After Certification

Similar Skills

Help us improve

/harness:certify

Resolve Tool Path

What To Do

Report

After Certification