Skill

meta-harness

Optimizes the harness code around a fixed base model (memory, retrieval, prompts, tool selection) via evolutionary Pareto search using native Agent/Workflow tools instead of a standalone Python driver.

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-forge:meta-harness

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

**Meta-Harness optimizes the *harness*, not the model.** The harness is the code around a

Supporting Files

assets/candidate_base-template.pyassets/proposer-prior-template.mdassets/scorer-template.pyassets/workflow-template.jsreferences/building-blocks.mdreferences/method.mdreferences/native-execution.mdreferences/proteus-example.mdscripts/pareto.py

SKILL.md

148 lines · ~2.3k tokens

Stats

LanguagePython

Stars23

Forks3

MaintenanceExcellent

Last CommitJun 14, 2026

Actions

View Source View Plugin View on GitHub View README

Meta-Harness (native)

What this is

Meta-Harness optimizes the harness, not the model. The harness is the code around a fixed base model that decides what to store, retrieve, compress, and show while the model works. You hold the model frozen and search over that scaffolding: propose candidate variants, score each on a cheap deterministic eval, keep a Pareto frontier (quality up, cost down), and iterate. The proposer is an LLM agent writing code; the inner loop is a cheap scorer.

The Stanford repo (stanford-iris-lab/meta-harness) ships a Python driver — claude_wrapper.py (~720 lines) + meta_harness.py (~540 lines) — that reimplements an agent runtime to drive a headless Claude: spawn a session, parse stream-json, track tool calls, log everything, loop. You already are that runtime. So you run the same loop with native tools (Agent, Workflow, /loop) and keep only the irreducible domain logic — a $0 scorer. The orchestration was never the hard part; your harness provides it.

This skill is the method, reusable for any harness-optimization task. A fully worked example (optimizing proteus's campaign-memory summarizer) lives at ~/mh-proteus/ and is walked through in references/proteus-example.md.

When to use this

Strong fit when several of these hold (full criteria in references/method.md):

The base model is fixed and the opportunity is better retrieval / memory / context / prompting / tool scaffolding. (This is the whole premise — if the gain must come from the model weights, this is the wrong tool: do RL/fine-tuning instead.)
There are repeated episodes / tasks, not a one-off.
There is a cheap, deterministic eval with a real success signal — or you can build one.
The search set is large enough to expose failure modes, small enough to iterate.
There are recurring error patterns a harness could fix systematically.

Poor fit: no stable eval loop, or purely subjective quality with no measurable criterion.

The loop (mental model)

seed frontier with the incumbent harness (the thing to beat)
repeat until budget/convergence:
    PROPOSE   k candidate harness variants   (proposer agents write code)
    VALIDATE  each imports / type-checks      (cheap reject of broken candidates)
    SCORE     each on the held-out-protected eval set   ($0 deterministic scorer)
    FRONTIER  Pareto-merge (quality up, cost down), floor-respecting
FINAL: score the frontier once on the untouched TEST split

The proposer is the mutation+crossover operator. The frontier is the persistent search memory. The held-out test split is touched exactly once, at the end — never during the search.

The five things YOU supply (everything else is native)

The orchestration is native; the domain is yours. Build these five — templates in assets/, how-to in references/building-blocks.md:

Candidate interface — one clean, swappable boundary (an ABC / Protocol). A candidate is a drop-in implementation. If your harness logic is tangled into one big function, extract the boundary first. → assets/candidate_base-template.py
A $0 deterministic scorer + rubric — the inner loop. It must vary with the candidate (see the frozen-replay trap below) and run with no LLM / no network so you can call it hundreds of times for free. → assets/scorer-template.py
An eval corpus with a held-out split — the tasks/records candidates are graded on, split so the test set shares no leaky structure with the search set.
A proposer prior — a short mini-SKILL the proposer agents load that steers them toward mechanism-level changes (not constant-tuning) and enforces anti-leakage. → assets/proposer-prior-template.md
A frontier + run log — the state carried across iterations (a JSON/JSONL pair, or just workflow variables). → scripts/pareto.py computes the frontier deterministically.

Non-negotiable guardrails — read before you run

These are where naive harness searches silently fail. Full treatment in references/method.md.

The frozen-replay defect (the #1 trap). If your eval replays cached outputs (a recorded run, a frozen trace), then a scaffolding candidate cannot change the recorded result — only the cost axis moves. A naive Pareto search then "wins" by emptying the context while the frozen quality score never drops, producing a confident, meaningless frontier. Fix: grade a quantity that genuinely varies with the candidate (retrieval relevance, compression fidelity, decision counterfactuals), and/or run quality as a one-sided do-no-harm floor rather than a maximize axis.
Held-out discipline. The proposer must see only the search-set results and the frontier — never the test split. Score test once, at the end.
Anti-Goodhart floor. The proposer is the most capable optimizer you have; it will exploit a soft metric. Put a hard floor on quality (and fix any known reward bugs) so it cannot win by degrading the thing you actually care about.
Anti-leakage. Forbid candidates from hardcoding any value from the eval set. Candidates must generalize to unseen tasks.

How to run it natively — pick a mode

Mode	Use when	How
Workflow (default)	a real search; want parallel proposers, journaled + resumable	`assets/workflow-template.js` via the `Workflow` tool
skill + `/loop`	leanest; you act as the proposer yourself, serially	a mini-skill body looped with `/loop`
Team	rarely — durable, long-lived proposer/scorer/curator roles	`TeamCreate` + tasks + messaging

Default to Workflow — it is the closest 1:1 to the Python harness and the best for an actual search. The mapping from each Meta-Harness piece to its native equivalent, and full mode details, are in references/native-execution.md.

Procedure

Frame the search. Name the fixed model, the harness surface to optimize, the eval, the two Pareto axes (quality, cost), and the budget. Confirm fit against references/method.md. If you cannot name a cheap eval that varies with the candidate, stop and build one first.
Build the five blocks from assets/ templates (or reuse an existing scaffold like ~/mh-proteus/). Validate the scorer runs at $0 on the incumbent before going further.
Baseline. Score the incumbent harness + a trivial anchor; seed the frontier.
Choose a mode (default Workflow). Copy assets/workflow-template.js, set the working dir, candidate count k, rounds/budget, and the floor.
Run the search. Proposers write candidates; the $0 scorer ranks them; Pareto-merge each round. Watch the frontier move (quality held at/above floor, cost dropping).
Inspect the frontier, not just the best point — the cost/quality tradeoff curve is the product.
Promote with re-validation. A frontier winner is a proposal. Before it ships, score it once on the untouched test split, and (if the search used a proxy eval) validate the proxy ranking against the real metric. Never let an unvalidated candidate become the new incumbent.

Files in this skill

references/method.md — theory, full fit criteria, the frozen-replay defect, all guardrails, how to choose the objective. Read when framing a new search or unsure about fit.
references/native-execution.md — the Meta-Harness→native mapping table and all three execution modes in depth (Workflow / loop / Team), including how scoring runs inside a Workflow.
references/building-blocks.md — how to build each of the five blocks, with worked patterns.
references/proteus-example.md — the end-to-end worked example at ~/mh-proteus/.
assets/workflow-template.js — the native search loop (the default mode). Parameterized.
assets/scorer-template.py, assets/candidate_base-template.py, assets/proposer-prior-template.md — templates for the domain blocks you supply.
scripts/pareto.py — deterministic Pareto-frontier computation over a results JSONL.

meta-harness

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

meta-harness

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Meta-Harness (native)

What this is

When to use this

The loop (mental model)

The five things YOU supply (everything else is native)

Non-negotiable guardrails — read before you run

How to run it natively — pick a mode

Procedure

Files in this skill

Similar Skills

Meta-Harness (native)

What this is

When to use this

The loop (mental model)

The five things YOU supply (everything else is native)

Non-negotiable guardrails — read before you run

How to run it natively — pick a mode

Procedure

Files in this skill

Similar Skills