From harness-forge
Optimizes the harness code around a fixed base model (memory, retrieval, prompts, tool selection) via evolutionary Pareto search using native Agent/Workflow tools instead of a standalone Python driver.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness-forge:meta-harnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Meta-Harness optimizes the *harness*, not the model.** The harness is the code around a
Meta-Harness optimizes the harness, not the model. The harness is the code around a fixed base model that decides what to store, retrieve, compress, and show while the model works. You hold the model frozen and search over that scaffolding: propose candidate variants, score each on a cheap deterministic eval, keep a Pareto frontier (quality up, cost down), and iterate. The proposer is an LLM agent writing code; the inner loop is a cheap scorer.
The Stanford repo (stanford-iris-lab/meta-harness) ships a Python driver —
claude_wrapper.py (~720 lines) + meta_harness.py (~540 lines) — that reimplements an
agent runtime to drive a headless Claude: spawn a session, parse stream-json, track tool
calls, log everything, loop. You already are that runtime. So you run the same loop with
native tools (Agent, Workflow, /loop) and keep only the irreducible domain logic — a $0
scorer. The orchestration was never the hard part; your harness provides it.
This skill is the method, reusable for any harness-optimization task. A fully worked
example (optimizing proteus's campaign-memory summarizer) lives at ~/mh-proteus/ and is
walked through in references/proteus-example.md.
Strong fit when several of these hold (full criteria in references/method.md):
Poor fit: no stable eval loop, or purely subjective quality with no measurable criterion.
seed frontier with the incumbent harness (the thing to beat)
repeat until budget/convergence:
PROPOSE k candidate harness variants (proposer agents write code)
VALIDATE each imports / type-checks (cheap reject of broken candidates)
SCORE each on the held-out-protected eval set ($0 deterministic scorer)
FRONTIER Pareto-merge (quality up, cost down), floor-respecting
FINAL: score the frontier once on the untouched TEST split
The proposer is the mutation+crossover operator. The frontier is the persistent search memory. The held-out test split is touched exactly once, at the end — never during the search.
The orchestration is native; the domain is yours. Build these five — templates in assets/,
how-to in references/building-blocks.md:
assets/candidate_base-template.pyassets/scorer-template.pyassets/proposer-prior-template.mdscripts/pareto.py computes the frontier deterministically.These are where naive harness searches silently fail. Full treatment in references/method.md.
| Mode | Use when | How |
|---|---|---|
| Workflow (default) | a real search; want parallel proposers, journaled + resumable | assets/workflow-template.js via the Workflow tool |
skill + /loop | leanest; you act as the proposer yourself, serially | a mini-skill body looped with /loop |
| Team | rarely — durable, long-lived proposer/scorer/curator roles | TeamCreate + tasks + messaging |
Default to Workflow — it is the closest 1:1 to the Python harness and the best for an actual
search. The mapping from each Meta-Harness piece to its native equivalent, and full mode details,
are in references/native-execution.md.
references/method.md.
If you cannot name a cheap eval that varies with the candidate, stop and build one first.assets/ templates (or reuse an existing scaffold like
~/mh-proteus/). Validate the scorer runs at $0 on the incumbent before going further.assets/workflow-template.js, set the working
dir, candidate count k, rounds/budget, and the floor.references/method.md — theory, full fit criteria, the frozen-replay defect, all guardrails,
how to choose the objective. Read when framing a new search or unsure about fit.references/native-execution.md — the Meta-Harness→native mapping table and all three
execution modes in depth (Workflow / loop / Team), including how scoring runs inside a Workflow.references/building-blocks.md — how to build each of the five blocks, with worked patterns.references/proteus-example.md — the end-to-end worked example at ~/mh-proteus/.assets/workflow-template.js — the native search loop (the default mode). Parameterized.assets/scorer-template.py, assets/candidate_base-template.py,
assets/proposer-prior-template.md — templates for the domain blocks you supply.scripts/pareto.py — deterministic Pareto-frontier computation over a results JSONL.npx claudepluginhub 001tmf/harness-forge --plugin harness-forgeRuns autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Designs autonomous agent harnesses with research loops, evaluation scaffolds, locked/editable surfaces, durable logs, novelty gates, pruning, rollback, and human approval boundaries.
Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.