LangSmith-native autonomous agent optimization plugin
npx claudepluginhub raphaelchristi/harness-evolverLangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees
Point at any LLM agent codebase. Harness Evolver will autonomously improve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver
npx harness-evolver@latest
Works with Claude Code, Cursor, Codex, and Windsurf.
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude
/harness:setup # explores project, configures LangSmith
/harness:health # check dataset quality (auto-corrects issues)
/harness:evolve # runs the optimization loop
/harness:status # check progress (rich ASCII chart)
/harness:deploy # tag, push, finalize
Tested on a RAG agent (Agno framework, Gemini 3.1 Flash Lite, light mode):
xychart-beta
title "agno-deepknowledge: 0.575 → 1.000 (+74%)"
x-axis ["base", "v001", "v002", "v003", "v004", "v005", "v006", "v007"]
y-axis "Correctness" 0 --> 1
line [0.575, 0.575, 0.950, 0.950, 0.950, 0.950, 0.950, 1.0]
bar [0.575, 0.333, 0.950, 0.720, 0.875, 0.680, 0.880, 1.0]
| Iter | Score | Merged? | What the proposer did |
|---|---|---|---|
| baseline | 0.575 | — | Original agent — hallucinations, broken tool calls, no retry logic |
| v001 | 0.333 | Yes | Anti-hallucination prompt (100% correct when API responded, but 60% hit rate limits) |
| v002 | 0.950 | Yes | Breakthrough: inlined 17-line KB into prompt, eliminated vector search entirely. 5.7x faster, zero rate limits |
| v003 | 0.720 | No | Attempted hybrid retrieval — regressed, rejected by constraint gate |
| v004 | 0.875 | No | Response completeness fix — improved one case but regressed others |
| v005 | 0.680 | No | Reduced tool calls — broke edge cases, rejected |
| v006 | 0.880 | Yes | Evolution memory insight: combined v001's anti-hallucination with one-shot example from archive |
| v007 | 1.000 | Yes | One-shot example injection + rubric-aligned responses — perfect on held-out |
The line shows best score (only goes up — regressions aren't merged). The bars show each candidate's raw score. 4 merged, 3 rejected by gate checks. Not every iteration improves — that's the point.
| LangSmith-Native | No custom scripts. Uses LangSmith Datasets, Experiments, and LLM-as-judge. Everything visible in the LangSmith UI. |
| Real Code Evolution | Proposers modify actual code in isolated git worktrees. Winners merge automatically. |
| Self-Organizing Proposers | Two-wave spawning, dynamic lenses from failure data, archive branching from losing candidates. Self-abstention when redundant. |
| Rubric-Based Evaluation | LLM-as-judge with justification-before-score, rubrics, few-shot calibration, pairwise comparison. |
| Smart Gating | Constraint gates, efficiency gate (cost/latency pre-merge), regression guards, Pareto selection, holdout enforcement, rate-limit early abort, stagnation detection. |
/harness:evolve
|
+- 1. Preflight (validate state + dataset health + baseline scoring)
+- 2. Analyze (trace insights + failure clusters + strategy synthesis)
+- 3. Propose (spawn N proposers in git worktrees, two-wave)
+- 4. Evaluate (canary → run target → auto-spawn LLM-as-judge → rate-limit abort)
+- 5. Select (held-out comparison → Pareto front → efficiency gate → constraint gate → merge)
+- 6. Learn (archive candidates + regression guards + evolution memory)
+- 7. Gate (plateau → target check → critic/architect → continue or stop)
Detailed loop with all sub-steps