Skill

harness:evolve

Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.

Python

npx claudepluginhub raphaelchristi/harness-evolver --plugin harness-evolver

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-evolver:evolve

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteEditBashGlobGrepAgentAskUserQuestion

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run the propose-evaluate-iterate loop. LangSmith is the evaluation backend, git worktrees provide isolation.

SKILL.md

326 lines · ~3.5k tokens

Similar Skills

harness:setup

Sets up Harness Evolver v3 in Python projects: explores codebase for entry points, configures LangSmith, runs baseline evaluation for LLM agent optimization.

8 tools

harness-evolver

evolve-interview

Socratic interview to clarify evolution tasks before execution. Asks targeted questions across four dimensions, scores ambiguity, produces initial.py, evaluate.py, config.json when ambiguity drops below 20%.

claude-evolve

autoresearch

Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.

4 files

autoresearch

Stats

LanguagePython

Stars21

Forks2

MaintenanceExcellent

Last CommitApr 18, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

/harness:evolve

Run the propose-evaluate-iterate loop. LangSmith is the evaluation backend, git worktrees provide isolation.

Setup

.evolver.json must exist. If not, tell user to run harness:setup.

TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"

Never pass LANGSMITH_API_KEY inline. Tools resolve it automatically via _common.ensure_langsmith_api_key().

Arguments

--iterations N (default: ask or 5)
--mode light|balanced|heavy — override mode from config
--no-interactive — skip prompts, use defaults (for cron/background runs)

If interactive, ask iterations (3/5/10), target score (0.8/0.9/0.95/none), and execution mode (interactive/background).

Mode Parameters

MODES = {
  "light":    {"proposers": 2, "waves": 1, "concurrency": 5, "timeout": 60, "sample": 10, "analysis": "summary", "pairwise": False, "archive": "winner"},
  "balanced": {"proposers": 3, "waves": 2, "concurrency": 3, "timeout": 120, "sample": None, "analysis": "summary", "pairwise": "if_close", "archive": "all"},
  "heavy":    {"proposers": 5, "waves": 2, "concurrency": 3, "timeout": 300, "sample": None, "analysis": "full", "pairwise": True, "archive": "all"},
}

Read mode from config, allow --mode override:

MODE=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('mode', 'balanced'))")

If not --no-interactive, confirm or switch:

{
  "question": "Mode: {MODE}. Continue?",
  "header": "Mode",
  "options": [
    {"label": "Yes, continue with {MODE}"},
    {"label": "Switch to light (~2 min/iter)"},
    {"label": "Switch to balanced (~8 min/iter)"},
    {"label": "Switch to heavy (~25 min/iter)"}
  ]
}

If changed, update config and re-read MODE.

Pre-Loop

Preflight

$EVOLVER_PY $TOOLS/preflight.py --config .evolver.json

Validates API key, config schema, LangSmith state, dataset health, and canary in one pass. If it fails, ask user: fix and retry, continue anyway, or abort. If health issues are auto-correctable, run /harness:health first.

Baseline LLM-Judge

If LLM evaluators (correctness, conciseness) are configured but baseline only has code-based scores, spawn the evaluator agent on the baseline experiment. Re-read and update best_score in .evolver.json after scoring.

Resolve Project Directory

Read project_dir from config. If non-empty, all worktree paths include it: {worktree}/{project_dir}/.

The Loop (per iteration)

0. Read State + Start Iteration Trace

BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
PROJECT_DIR=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('project_dir', ''))")
ITER_START=$(date +%s)

Start iteration trace (logs to LangSmith for observability):

ITER_TRACE=$($EVOLVER_PY $TOOLS/log_iteration.py --config .evolver.json --action start --version v{NNN} 2>/dev/null)
ITER_RUN_ID=$(echo "$ITER_TRACE" | python3 -c "import sys,json; print(json.load(sys.stdin).get('run_id',''))" 2>/dev/null)
ITER_DOTTED_ORDER=$(echo "$ITER_TRACE" | python3 -c "import sys,json; print(json.load(sys.stdin).get('dotted_order',''))" 2>/dev/null)

If log_iteration.py fails (no LangSmith, no key), the loop continues — tracing is optional.

If $BEST is empty (no baseline ran), skip data gathering — proposers work from code analysis only.

1. Gather Data (parallel)

Analysis format depends on mode (MODES[MODE]["analysis"]):

if [ -n "$BEST" ]; then
    ANALYSIS_FMT=$(python3 -c "m={'light':'summary','balanced':'summary','heavy':'full'}; print(m.get('$MODE','summary'))")
    $EVOLVER_PY $TOOLS/trace_insights.py --from-experiment "$BEST" --format $ANALYSIS_FMT --output trace_insights.json &
    $EVOLVER_PY $TOOLS/read_results.py --experiment "$BEST" --config .evolver.json --split train --format $ANALYSIS_FMT --output best_results.json &
    wait
fi

2. Generate Strategy + Lenses

From trace_insights.json, best_results.json, evolution_memory.md, production_seed.json:

strategy.md — Current iteration data ONLY. No stale info. Contents: target files, failure clusters (latest experiment), top 3 promoted memory insights (rec >= 2), approaches to avoid, top 3 failing examples with judge feedback. Cap at 1500 tokens.

lenses.json — Investigation questions for proposers:

One per failure cluster (max 3), one architecture, one production, one evolution_memory, one open
If evolution_archive/ has 3+ iterations, one archive_branch lens that suggests revisiting a losing candidate's approach
Sort by severity, cap at 5 lenses

3. Spawn Proposers (mode-dependent)

Proposer count: MODES[MODE]["proposers"] (light=2, balanced=3, heavy=5). Cap lenses at this number. Waves: MODES[MODE]["waves"] (light=1 single wave, balanced/heavy=2 two-wave).

Build IDENTICAL shared prefix (objective + files_to_read + context) for KV-cache sharing. Only the <lens> block differs — place it LAST. Include evolution_archive/ in <files_to_read> so proposers can grep prior candidates.

IMPORTANT: After each proposer worktree is created, copy untracked files and set trace nesting. Always use absolute paths:

SRC="$(dirname "$(git rev-parse --git-common-dir)")"
[ -n "$PROJECT_DIR" ] && SRC="$SRC/$PROJECT_DIR"
# If langsmith-tracing companion is installed, proposer traces nest under iteration:
[ -n "$ITER_DOTTED_ORDER" ] && export CC_LANGSMITH_PARENT_DOTTED_ORDER="$ITER_DOTTED_ORDER"
# For each worktree (after Agent creates it, before agent reads files):
cp "$SRC/.evolver.json" "$WT_PROJECT/.evolver.json"
[ -f "$SRC/.env" ] && cp "$SRC/.env" "$WT_PROJECT/.env"
[ -d "$SRC/evolution_archive" ] && cp -r "$SRC/evolution_archive" "$WT_PROJECT/evolution_archive"

Do NOT suppress stderr with 2>/dev/null — if the copy fails, you need to see the error.

Wave 1 — critical + high severity lenses, run independently in parallel:

Agent(
  subagent_type: "harness-proposer",
  isolation: "worktree",
  run_in_background: true,
  prompt: "{SHARED_PREFIX}\n\n<lens>\n{lens.question}\nSource: {lens.source}\n</lens>"
)

Wait for wave 1 to complete. Report each completion as it happens.

Wave 2 — medium + open lenses, see wave 1 results before starting:

Add to the shared context for wave 2 proposers:

<prior_proposals>
Wave 1 proposers completed:
- Proposer {id} ({lens}): {approach from proposal.md} — {committed/abstained}
...
</prior_proposals>

Wave 2 proposers see what wave 1 tried and can build on it, avoid duplication, or take complementary approaches. Research shows +14% quality when agents observe prior outputs.

If only 1-2 lenses total, run as single wave.

4. Evaluate Candidates

Run evaluations with mode parameters. run_eval.py auto-copies config files to worktrees:

CONCURRENCY=$(python3 -c "m={'light':5,'balanced':3,'heavy':3}; print(m.get('$MODE',3))")
TIMEOUT=$(python3 -c "m={'light':60,'balanced':120,'heavy':300}; print(m.get('$MODE',120))")
SAMPLE=$(python3 -c "m={'light':'10','balanced':'','heavy':''}; s=m.get('$MODE',''); print(f'--sample {s} --sample-split train' if s else '')")

for WT in {worktree_paths_with_commits}; do
    WT_PROJECT="$WT"
    [ -n "$PROJECT_DIR" ] && WT_PROJECT="$WT/$PROJECT_DIR"
    $EVOLVER_PY $TOOLS/run_eval.py --config "$SRC/.evolver.json" --worktree-path "$WT_PROJECT" --experiment-prefix v{NNN}-{id} --concurrency $CONCURRENCY --timeout $TIMEOUT $SAMPLE &
done
wait  # CRITICAL: wait for ALL evals before judge

Note: $SRC is set via git rev-parse --git-common-dir — resolves to the main repo root even when CWD is inside a worktree (--show-toplevel returns the worktree root, which is wrong).

Auto-spawn LLM-as-judge — check if LLM evaluators are configured and automatically spawn the evaluator agent. Do NOT leave this as a manual step for the user:

LLM_EVALS=$(python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')")

If LLM_EVALS is non-empty, spawn the evaluator agent immediately after evals complete:

Agent(
  subagent_type: "harness-evaluator",
  prompt: "Experiments: {names}. Evaluators: {LLM_EVALS}. Dataset: {dataset_name}. Use rubrics from example metadata when available."
)

Wait for evaluator to complete before comparing. This is NOT optional — the combined score is meaningless without LLM-judge scores.

5. Compare + Constraint Gate + Merge

$EVOLVER_PY $TOOLS/read_results.py --experiments "{names}" --config .evolver.json --split held_out --output comparison.json

Winner = highest score on held-out data. Report Pareto front and diversity grid if multiple non-dominated candidates.

Pairwise comparison (mode-dependent: light=never, balanced=if top 2 within 5%, heavy=always):

$EVOLVER_PY $TOOLS/read_results.py --pairwise "{winner},{runner_up}" --config .evolver.json --split held_out

If pairwise disagrees with independent scoring, flag for user review.

Resolve project_dir for constraint worktree path. Baseline stays . because CWD is already the project directory:

WINNER_PROJECT="{winner_wt}"
[ -n "$PROJECT_DIR" ] && WINNER_PROJECT="{winner_wt}/$PROJECT_DIR"
$EVOLVER_PY $TOOLS/constraint_check.py --config .evolver.json --worktree-path "$WINNER_PROJECT" --baseline-path "."

If constraints fail, try next-best. If none pass, skip merge.

Efficiency gate (before merge): Check if winner's tokens or latency regressed significantly:

If tokens increased >2x AND score improved <2%: reject this candidate, try next-best
If latency increased >50% AND score improved <5%: reject this candidate, try next-best
In interactive mode: ask user to override if desired. In background mode: auto-reject.

If winner beats current best AND passes efficiency gate:

# 1. Backup config (merge will overwrite with worktree's stale copy)
$EVOLVER_PY $TOOLS/update_config.py --config .evolver.json --action backup

# 2. Merge
git merge {winner_branch} --no-edit -m "evolve: merge v{NNN} (score: {score})"

# 3. Restore config (merge brought stale copy)
$EVOLVER_PY $TOOLS/update_config.py --config .evolver.json --action restore

# 4. Update config with enriched history (one command, no inline Python)
$EVOLVER_PY $TOOLS/update_config.py --config .evolver.json --action update \
    --winner-experiment "{winner}" --winner-score {score} \
    --approach "{approach}" --lens "{lens}" \
    --tokens {tokens} --latency-ms {latency} --error-count {errors} \
    --passing {passing} --total {total} \
    --per-evaluator '{json_dict}' --code-loc {loc}

Git-tag for rollback:

git tag "evo-iter-v{NNN}" -m "harness: v{NNN} score={score}"

Note: uses evo-iter- prefix to avoid conflicts with /harness:deploy tags.

6. Post-Iteration

Archive candidates (light=winner only, balanced/heavy=all) for future proposer reference:

for CANDIDATE in {all_worktree_paths}; do
    $EVOLVER_PY $TOOLS/archive.py --config .evolver.json --version v{NNN}-{id} --experiment "{exp}" --worktree-path "$CANDIDATE" --score {score} --approach "{approach}" --lens "{lens}" $([ "{exp}" = "{winner}" ] && echo "--won")
done

Regression tracking (if not first iteration):

$EVOLVER_PY $TOOLS/regression_tracker.py --config .evolver.json --previous-experiment "$PREV" --current-experiment "$WINNER" --add-guards --auto-guard-failures --max-guards 5

Report: Iteration {i}/{N}: v{NNN} scored {score} (best: {best_score})

End iteration trace:

ITER_DURATION=$(( $(date +%s) - ITER_START ))
$EVOLVER_PY $TOOLS/log_iteration.py --config .evolver.json --action end \
    --run-id "$ITER_RUN_ID" --score {winner_score} --merged {true|false} \
    --approach "{approach}" --lens "{lens}" --candidates {num_evaluated} \
    --duration $ITER_DURATION 2>/dev/null

Consolidate (background):

Agent(subagent_type: "harness-consolidator", run_in_background: true, prompt: "Update evolution_memory.md...")

Proactive evaluator evolution: After reading all proposal.md files, check for ## Suggested Evaluators sections. If any proposer suggested new evaluators or rubrics, surface them:

Proposer v{NNN}-{id} suggested new evaluator: "{name}" — {description}

If multiple proposers suggest the same evaluator, prioritize it. Do NOT add evaluators that have no implementation — add_evaluator.py only supports code evaluators with templates (see CODE_EVALUATOR_TEMPLATES in the tool) and LLM evaluators (correctness, conciseness). If a suggestion doesn't match a known template, log it for the architect/critic to implement manually rather than silently adding a no-op entry.

Auto-trigger critic if score jumped >0.3 or hit target in <3 iterations.

Auto-trigger architect (opus model) if 3 consecutive iterations within 1% or score dropped.

Cleanup worktrees (free disk space after eval):

$EVOLVER_PY $TOOLS/cleanup_worktrees.py --dir "$SRC"

7. Gate Check

Score plateau: 3 scores within 2% → consider architect or stop
Target reached: best_score >= target_score → stop
Diminishing returns: avg improvement <0.5% over 5 iterations → stop

(Cost/latency regressions are now checked pre-merge in step 5, not post-merge.)

Final Report

$EVOLVER_PY $TOOLS/evolution_chart.py --config .evolver.json

Plus: LangSmith URL, git log --oneline summary, suggest /harness:deploy.

harness:evolve

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

harness:evolve

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

/harness:evolve

Setup

Arguments

Mode Parameters

Pre-Loop

Preflight

Baseline LLM-Judge

Resolve Project Directory

The Loop (per iteration)

0. Read State + Start Iteration Trace

1. Gather Data (parallel)

2. Generate Strategy + Lenses

3. Spawn Proposers (mode-dependent)

4. Evaluate Candidates

5. Compare + Constraint Gate + Merge

6. Post-Iteration

7. Gate Check

Final Report

Similar Skills

Help us improve

/harness:evolve

Setup

Arguments

Mode Parameters

Pre-Loop

Preflight

Baseline LLM-Judge

Resolve Project Directory

The Loop (per iteration)

0. Read State + Start Iteration Trace

1. Gather Data (parallel)

2. Generate Strategy + Lenses

3. Spawn Proposers (mode-dependent)

4. Evaluate Candidates

5. Compare + Constraint Gate + Merge

6. Post-Iteration

7. Gate Check

Final Report