Plugin

harness-evolver

Name: harness-evolver
Author: raphaelchristi

Evolve LLM agent code autonomously in Python projects using LangSmith evaluations, multi-agent proposers, and git worktrees. Run propose-evaluate-iterate loops to boost performance, check dataset quality and score stability, analyze architectures on stagnation, audit evaluators for issues, generate diverse tests, monitor progress charts, and commit tagged improvements.

Component Overview

Commands

harness-architect, harness-consolidator +4

Agents

harness:certify, harness:deploy +4

Skills

SessionStart

Hooks

MCP Servers

LSP Servers

Output Styles

Install

npx claudepluginhub raphaelchristi/harness-evolver --plugin harness-evolver

Component Details

Agents (6)

harness-architect

/harness-architect

Use this agent when the evolution loop stagnates or regresses. Analyzes the agent architecture and recommends topology changes (single-call → RAG, chain → ReAct, etc.).

harness-consolidator

/harness-consolidator

Background agent for cross-iteration memory consolidation. Runs after each iteration to extract learnings and update evolution_memory.md. Read-only analysis — does not modify agent code.

harness-critic

/harness-critic

Use this agent when scores converge suspiciously fast, evaluator quality is questionable, or the agent reaches high scores in few iterations. Detects gaming AND implements fixes.

harness-evaluator

/harness-evaluator

Use this agent to evaluate experiment outputs using LLM-as-judge. Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness, and writes scores back as feedback. No external API keys needed.

harness-proposer

/harness-proposer

Self-organizing agent optimizer. Investigates a data-driven lens (question), decides its own approach, and modifies real code in an isolated git worktree. May self-abstain if it cannot add meaningful value.

harness-testgen

/harness-testgen

Use this agent to generate test inputs for the evaluation dataset. Spawned by the setup skill when no test data exists.

Skills (6)

harness:certify

/certify

Use when the user wants to verify that the evolved agent's score is stable and reliable. Runs evaluation multiple times and reports mean ± std.

harness:deploy

/deploy

Use when the user is done evolving and wants to finalize, clean up, tag the result, or push the optimized agent.

harness:evolve

/evolve

Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run harness:setup first).

harness:health

/health

Use when the user wants to check dataset quality, diagnose eval issues, or before running evolve. Checks size, difficulty distribution, dead examples, coverage, and splits. Auto-corrects issues found.

harness:setup

/setup

Use when the user wants to set up the evolver in their project, optimize an LLM agent, improve agent performance, or mentions evolver for the first time in a project without .evolver.json.

harness:status

/status

Use when the user asks about evolution progress, current scores, best version, how many iterations ran, or whether the loop is stagnating.

Hooks (1)

Review workflow modifications before installing

Event Hooks

1 hook across 1 event

README

Harness Evolver

Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Install

Claude Code Plugin (recommended)

/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver

npx (first-time setup or non-Claude Code runtimes)

npx harness-evolver@latest

Works with Claude Code, Cursor, Codex, and Windsurf.

Quick Start

cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/harness:setup      # explores project, configures LangSmith
/harness:health     # check dataset quality (auto-corrects issues)
/harness:evolve     # runs the optimization loop
/harness:status     # check progress (rich ASCII chart)
/harness:deploy     # tag, push, finalize

What It Looks Like

Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):

xychart-beta
    title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
    x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
    y-axis "Correctness" 0 --> 1
    line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
    bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]

Iter	Score	Merged?	What the proposer did
baseline	0.575	—	Original agent — hallucinations, broken tool calls, no retry logic
v001	0.333	Yes	Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits)
v002	0.950	Yes	Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits
v003	0.720	No	Attempted hybrid retrieval — regressed, rejected by constraint gate
v004	0.875	No	Response completeness fix — improved one case but regressed others
v005	0.680	No	Reduced tool calls — broke edge cases, rejected
v006	0.880	Yes	Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive
v007	1.000	Yes	One-shot example injection + rubric-aligned responses — perfect on held-out

The line shows best score (only goes up — regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves — that's the point.

How It Works


LangSmith-Native	No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI.
Real Code Evolution	Proposers modify actual code in isolated git worktrees. Winners merge automatically.
Self-Organizing Proposers	Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant.
Rubric-Based Evaluation	LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison.
Smart Gating	Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection.

Full feature list

Evolution Loop

/harness:evolve
  |
  +- 1. Preflight  (validate state + dataset health + baseline scoring)
  +- 2. Analyze    (trace insights + failure clusters + strategy synthesis)
  +- 3. Propose    (spawn N proposers in git worktrees, two-wave)
  +- 4. Evaluate   (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
  +- 5. Select     (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
  +- 6. Learn      (archive candidates + regression guards + evolution memory)
  +- 7. Gate       (plateau → target check → critic/architect → continue or stop)

Detailed loop with all sub-steps

Agents

View full README on GitHub

Similar Plugins

qiushi-skill

2.8k

Qiushi Skill: methodology skills for AI agents guided by seeking truth from facts, with Claude Code, Cursor, OpenClaw, Codex, OpenCode, and Hermes guidance.

Stats

Version6.4.2

Stars12

Forks2

MaintenanceExcellent

LicenseMIT

Last CommitApr 18, 2026

AddedMar 31, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Available In

harness-evolver-marketplace12 harness-evolver-marketplace

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

Harness Evolver

Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Install

Claude Code Plugin (recommended)

/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver

npx (first-time setup or non-Claude Code runtimes)

npx harness-evolver@latest

Works with Claude Code, Cursor, Codex, and Windsurf.

Quick Start

cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/harness:setup      # explores project, configures LangSmith
/harness:health     # check dataset quality (auto-corrects issues)
/harness:evolve     # runs the optimization loop
/harness:status     # check progress (rich ASCII chart)
/harness:deploy     # tag, push, finalize

What It Looks Like

Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):

xychart-beta
    title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
    x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
    y-axis "Correctness" 0 --> 1
    line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
    bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]

Iter	Score	Merged?	What the proposer did
baseline	0.575	—	Original agent — hallucinations, broken tool calls, no retry logic
v001	0.333	Yes	Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits)
v002	0.950	Yes	Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits
v003	0.720	No	Attempted hybrid retrieval — regressed, rejected by constraint gate
v004	0.875	No	Response completeness fix — improved one case but regressed others
v005	0.680	No	Reduced tool calls — broke edge cases, rejected
v006	0.880	Yes	Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive
v007	1.000	Yes	One-shot example injection + rubric-aligned responses — perfect on held-out

How It Works


LangSmith-Native	No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI.
Real Code Evolution	Proposers modify actual code in isolated git worktrees. Winners merge automatically.
Self-Organizing Proposers	Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant.
Rubric-Based Evaluation	LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison.
Smart Gating	Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection.

Full feature list

Evolution Loop

/harness:evolve
  |
  +- 1. Preflight  (validate state + dataset health + baseline scoring)
  +- 2. Analyze    (trace insights + failure clusters + strategy synthesis)
  +- 3. Propose    (spawn N proposers in git worktrees, two-wave)
  +- 4. Evaluate   (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
  +- 5. Select     (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
  +- 6. Learn      (archive candidates + regression guards + evolution memory)
  +- 7. Gate       (plateau → target check → critic/architect → continue or stop)

Detailed loop with all sub-steps

harness-evolver

Component Overview

Install

Component Details

Agents (6)

Skills (6)

Hooks (1)

README

Harness Evolver

Install

Claude Code Plugin (recommended)

npx (first-time setup or non-Claude Code runtimes)

Quick Start

What It Looks Like

How It Works

Evolution Loop

Agents

Similar Plugins

qiushi-skill

harness-evolver

Component Overview

Install

Component Details

Agents (6)

Skills (6)

Hooks (1)

README

Harness Evolver

Install

Claude Code Plugin (recommended)

npx (first-time setup or non-Claude Code runtimes)

Quick Start

What It Looks Like

How It Works

Evolution Loop

Agents

Similar Plugins

qiushi-skill

caveman

ui-design

prompt-improver

claude-mem

prompts.chat