Harness Forge is a Claude Code skill that runs an end-to-end
harness-optimization loop — propose → score → keep the Pareto-best → repeat — to improve the
code around a fixed model: its memory, retrieval, context construction, summarization, prompt
templates, and tool-selection logic. The model never changes; the scaffolding gets better.
It is a native reimplementation of the method in
Meta-Harness: End-to-End Optimization of Model Harnesses
(Lee, Nair, Zhang, Lee, Khattab & Finn, 2026). The original
reference repo ships ~1,260 lines of Python
(claude_wrapper.py + meta_harness.py) whose job is to drive a headless Claude: spawn a
session, parse its output, track tool calls, log everything, loop. Inside Claude Code, that
runtime already exists as first-class tools. So Harness Forge keeps only the irreducible domain
logic — a cheap scorer — and expresses the entire outer loop as native orchestration. The whole
search becomes ~75 lines instead of ~1,260.
The idea in one picture
seed the frontier with the incumbent harness (the thing to beat)
repeat:
PROPOSE k candidate harness variants ← parallel proposer agents write code
VALIDATE each imports / type-checks
SCORE each on a held-out-protected eval ← a $0, deterministic scorer
FRONTIER Pareto-merge: quality up, cost down, floor-respecting
final: score the frontier once on the untouched test split
The proposer is the mutation operator. The frontier is the search memory. The model is frozen
throughout — which is exactly why this fits a fixed / off-the-shelf-API deployment, where you
can't change the weights and the gain has to come from the harness.
The paper's headline result was +7.7 accuracy points at ~4× fewer context tokens on text
classification — a pure harness-side win. Harness Forge reproduces that shape of result natively.
Why native?
claude_wrapper.py is a hand-rolled agent runtime. Claude Code is an agent runtime. So every
orchestration piece has a native equivalent, and the Python driver becomes redundant:
| Meta-Harness (Python) | Harness Forge (native) |
|---|
claude_wrapper.run() — drive a headless Claude | Agent / agent() inside a Workflow |
meta_harness.py outer loop | a Workflow script (parallel / while) |
pending_eval.json handshake | a typed schema return — no file round-trip |
evolution_summary.jsonl / frontier.json | workflow variables + a results JSONL |
SKILL.md proposer prior | a skill / prior file the proposer agent reads |
| "run N iterations" | the workflow loop, /loop, or CronCreate |
| 3 candidates / iteration (serial) | parallel() — proposers run concurrently |
inner_loop.py scorer | stays a script — the one irreducible piece |
The only thing you still write is the cheap scorer + rubric + candidate interface. Everything
orchestration-shaped is free.
Quick start
1. Install the skill — one line:
curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash
Or as a Claude Code plugin (inside Claude Code):
/plugin marketplace add 001TMF/harness-forge
/plugin install harness-forge@tmf-skills
Other ways
# project-scoped (./.claude/skills, this repo only)
curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash -s -- --project
# via skills.sh (vercel-labs/skills)
npx skills add 001TMF/harness-forge --skill meta-harness -a claude-code
# manual
git clone https://github.com/001TMF/harness-forge.git
cp -r harness-forge/skills/meta-harness ~/.claude/skills/meta-harness
It auto-triggers when you talk about optimizing a harness, scaffold, prompt system, memory or
retrieval policy, or summarizer — or invoke it directly as the meta-harness skill.
2. Run the worked example ($0, no model, no network):
cd harness-forge/examples/memory-summary
python score_baselines.py
# -> baseline_incumbent fidelity=1.000 chars=269 (the system to beat)
3. Run a real search — invoke the Workflow tool with the example's loop script: