Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Optimize LLM agent code performance through automated evolution loops. Runs multi-agent proposals, LangSmith evaluations, and git worktrees to iteratively improve agents, with built-in evaluator auditing, dataset quality checks, stagnation detection, and architecture analysis when progress stalls.
npx claudepluginhub raphaelchristi/harness-evolver --plugin harness-evolverUse this agent when the evolution loop stagnates or regresses. Analyzes the agent architecture and recommends topology changes (single-call β RAG, chain β ReAct, etc.).
Background agent for cross-iteration memory consolidation. Runs after each iteration to extract learnings and update evolution_memory.md. Read-only analysis β does not modify agent code.
Use this agent when scores converge suspiciously fast, evaluator quality is questionable, or the agent reaches high scores in few iterations. Detects gaming AND implements fixes.
Use this agent to evaluate experiment outputs using LLM-as-judge. Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness, and writes scores back as feedback. No external API keys needed.
Self-organizing agent optimizer. Investigates a data-driven lens (question), decides its own approach, and modifies real code in an isolated git worktree. May self-abstain if it cannot add meaningful value.
Use when the user wants to verify that the evolved agent's score is stable and reliable. Runs evaluation multiple times and reports mean Β± std.
Use when the user is done evolving and wants to finalize, clean up, tag the result, or push the optimized agent.
Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run harness:setup first).
Use when the user wants to check dataset quality, diagnose eval issues, or before running evolve. Checks size, difficulty distribution, dead examples, coverage, and splits. Auto-corrects issues found.
Use when the user wants to set up the evolver in their project, optimize an LLM agent, improve agent performance, or mentions evolver for the first time in a project without .evolver.json.
Uses power tools
Uses Bash, Write, or Edit tools
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Evolutionary code discovery using Claude Code models
Hive agent skills for collaborative evolution. /hive-setup installs hive-evolve, registers your agent, and clones a task. /hive runs the autonomous experiment loop. /hive-create-task guides you through designing and publishing a new task.
Autonomous experiment loop that optimizes any file by a measurable metric. 5 slash commands, 8 evaluators, configurable loop intervals (10min to monthly).
Multi-agent collaboration plugin for Claude Code. Spawn N parallel subagents that compete on code optimization, content drafts, research approaches, or any problem that benefits from diverse solutions. Evaluate by metric or LLM judge, merge the winner. 7 slash commands, agent templates, git DAG orchestration, message board coordination.
Research harness for optimizing code with the GEPA algorithm (LLM-driven genetic-Pareto search).
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Point at any LLM agent codebase. Harness Evolver will autonomously improve it β prompts, routing, tools, architecture β using multi-agent evolution with LangSmith as the evaluation backend.
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver
npx harness-evolver@latest
Works with Claude Code, Cursor, Codex, and Windsurf.
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/harness:setup # explores project, configures LangSmith
/harness:health # check dataset quality (auto-corrects issues)
/harness:evolve # runs the optimization loop
/harness:status # check progress (rich ASCII chart)
/harness:deploy # tag, push, finalize
Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):
xychart-beta
title "agno-deepknowledge: 0.575 β 1.000 (+74%)"
x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
y-axis "Correctness" 0 --> 1
line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]
| Iter | Score | Merged? | What the proposer did |
|---|---|---|---|
| baseline | 0.575 | β | Original agent β hallucinations, broken tool calls, no retry logic |
| v001 | 0.333 | Yes | Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits) |
| v002 | 0.950 | Yes | Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits |
| v003 | 0.720 | No | Attempted hybrid retrieval β regressed, rejected by constraint gate |
| v004 | 0.875 | No | Response completeness fix β improved one case but regressed others |
| v005 | 0.680 | No | Reduced tool calls β broke edge cases, rejected |
| v006 | 0.880 | Yes | Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive |
| v007 | 1.000 | Yes | One-shot example injection + rubric-aligned responses β perfect on held-out |
The line shows best score (only goes up β regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves β that's the point.
| LangSmith-Native | No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI. |
| Real Code Evolution | Proposers modify actual code in isolated git worktrees. Winners merge automatically. |
| Self-Organizing Proposers | Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant. |
| Rubric-Based Evaluation | LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison. |
| Smart Gating | Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection. |
/harness:evolve
|
+- 1. Preflight (validate state + dataset health + baseline scoring)
+- 2. Analyze (trace insights + failure clusters + strategy synthesis)
+- 3. Propose (spawn N proposers in git worktrees, two-wave)
+- 4. Evaluate (canary β run target β auto-spawn LLM-as-judge β rate-limit abort)
+- 5. Select (held-out comparison β Pareto front β efficiency gate β constraint gate β merge)
+- 6. Learn (archive candidates + regression guards + evolution memory)
+- 7. Gate (plateau β target check β critic/architect β continue or stop)
Detailed loop with all sub-steps