Stats

Actions

Available In

Tags

modelharness

Make every model cheaper or better. Measured on four Claude models — none got worse.

What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:

✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly

⚡ Act, don't overplan — enough information means act, not narrate options

🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions

🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done"

🔀 Delegation triggers — explicit rules for when to fan work out to subagents

📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work

What the 17 tasks test

🐛 Bug hunts _{4 tasks · 96 runs}
Find and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover

✨ Features from spec _{4 tasks · 96 runs}
Build to a written spec: retry backoff, config merging, cursor pagination, slugify

♻️ Refactors _{2 tasks · 48 runs}
Restructure code with zero behavior change, verified structurally

🧠 Long-horizon builds _{2 tasks · 48 runs}
Multi-stage pipelines where later steps depend on earlier decisions

🧩 Spec-dense traps _{3 tasks · 72 runs}
18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading

🔁 Session handoffs _{2 tasks · 48 runs}
A fresh session must finish another session's work — memory is the only bridge

17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.

The results

Cost per task with and without modelharness

The numbers, exactly

Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).

What we measured

Fable 5

Opus 4.8 ⭐ _{biggest gain}

Sonnet 4.6

Haiku 4.5

_{plain model}

_{+ modelharness}

_{plain model}

_{+ modelharness}

_{plain model}

_{+ modelharness}

_{plain model}

_{+ modelharness}

Tasks completed successfully

100%

🔴 98%

🟢 100%

Average cost per task

$1.80

🟢 $1.73

$0.89

🟢 $0.77

$0.41

🟢 $0.40

$0.24

_{· bug hunts}

_$1.30

_{🟢 $1.26}

_$0.63

_{🟢 $0.55}

_$0.26

_{🟢 $0.24}

_$0.16

_{🟢 $0.13}

_{· features from spec}

_$1.44

_{🔴 $1.49}

_$0.76

_{🟢 $0.60}

_$0.34

_{🟢 $0.32}

_$0.18

_{🟢 $0.16}

_{· refactors}

_$0.91

_$0.51

modelharness

Make every model cheaper or better. Measured on four Claude models — none got worse.


✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly	⚡ Act, don't overplan — enough information means act, not narrate options
🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions	🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done"
🔀 Delegation triggers — explicit rules for when to fan work out to subagents	📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work

What the 17 tasks test


🐛 Bug hunts _{4 tasks · 96 runs} Find and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover	✨ Features from spec _{4 tasks · 96 runs} Build to a written spec: retry backoff, config merging, cursor pagination, slugify	♻️ Refactors _{2 tasks · 48 runs} Restructure code with zero behavior change, verified structurally
🧠 Long-horizon builds _{2 tasks · 48 runs} Multi-stage pipelines where later steps depend on earlier decisions	🧩 Spec-dense traps _{3 tasks · 72 runs} 18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading	🔁 Session handoffs _{2 tasks · 48 runs} A fresh session must finish another session's work — memory is the only bridge

The results

Cost per task with and without modelharness

The numbers, exactly

Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).

What we measured	Fable 5		Opus 4.8 ⭐ _{biggest gain}		Sonnet 4.6		Haiku 4.5
What we measured	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}
Tasks completed successfully	100%	100%	100%	100%	100%	100%	🔴 98%	🟢 100%
Average cost per task	$1.80	🟢 $1.73	$0.89	🟢 $0.77	$0.41	🟢 $0.40	$0.24	$0.24
_{· bug hunts}	_$1.30	_{🟢 $1.26}	_$0.63	_{🟢 $0.55}	_$0.26	_{🟢 $0.24}	_$0.16	_{🟢 $0.13}
_{· features from spec}	_$1.44	_{🔴 $1.49}	_$0.76	_{🟢 $0.60}	_$0.34	_{🟢 $0.32}	_$0.18	_{🟢 $0.16}
_{· refactors}	_$0.91	_$0.91	_$0.51	_$0.51

Find plugins for your project

modelharness

Popularity

Health & Quality

What's Inside

Confidence

README

modelharness

What the 17 tasks test

The results

The numbers, exactly

Similar Plugins

everything-claude-code

agent-orchestration

fablize

harness-claude

claude-router

foundry

modelharness

What the 17 tasks test

The results

The numbers, exactly

Popularity

Health & Quality

Similar Plugins

everything-claude-code

agent-orchestration

fablize

harness-claude

claude-router

foundry