cap-evolve

Optimize AI-agent capabilities (skills, tools/MCP, prompts, system prompts) against your eval, honestly. Skills-native and host-agnostic, with honest train/val/test discipline (sealed test, val-only gate) enforced by core-owned hooks. Exposes every phase/algorithm/capability/optimizer/orchestrate skill as a /cap-evolve:<skill> command, ships read-only diagnoser + writing proposer subagents, and a session-start router (using-cap-evolve) that auto-triggers the pipeline.

cap-evolve

watch capability evolve

cap-evolve is a skills-based, host-agnostic harness that optimizes any agent capability — a system prompt, its tools/MCP, or a whole skill package — against your eval, with honesty enforced in code and every iteration git-versioned.

You wire a tiny adapter once (or let a coding agent write it for you). cap-evolve runs the loop: evaluate → diagnose failures → propose an edit → keep it only if it beats a held-out split by a significant margin → commit → report a single honest number. It optimizes what your agent reads, not its weights.

Contents: Why · Install · Toy example · tau2-bench example · Optimize your own · How it works · Comparison · Skill library · Results · How-to guides · License

Why cap-evolve

Optimizes prompts, tools/MCP, and skill packages — not just prompts. Pick one or several ([system-prompt, tools, mcp-tool, skill-package]) and optimize them jointly.
Code-bearing optimization, not just reword. For tools, the optimizer can edit tool code — add validation/loop/composite tools that enforce a rule or perform a stalled action in code (the fix for a behavioral failure prose can't reach) — and safely swap, never bare-remove, a primitive.
Onboard any benchmark/agent from a single prompt. Paste one intake brief to your coding agent; intake does the full integration — installs the benchmark, wires the adapter (incl. trajectories + a batched fast path), authors a capability-scoped optimizer prompt, and passes the cap-evolve check gate — before any budget is spent. No pre-integration.
Per-task causal feedback. Each iteration the optimizer sees which task ids a prior edit BROKE and FIXED plus the currently-passing set to protect — so it makes large, multi-cluster edits that don't regress the wins.
Honesty enforced in code, not docs. The sealed test split is scored exactly once and a paired, val-only significance gate (Δ > k·SE) decides every acceptance — both live in the cap_evolve core, the only place rewards are aggregated.
Host- and agent-agnostic. The optimizer is any coding-agent CLI (claude-code, codex, gemini, opencode, ibm-bob, …) resolved by one registry row. No framework lock-in.
Git-versioned iterations + optimizer memory — every candidate is a commit; rejected approaches are remembered and never re-proposed.
Per-iteration optimizer $ budget, enforced by the optimizer CLI itself (e.g. claude --max-budget-usd), plus hard total caps and a dry-run estimate.
Live dashboard — per-iteration optimizer & runner cost + time, intake cost, lineage tree, per-iteration diffs, and a tasks × iterations pass/fail heatmap.
Skills-native & trivially extensible — a new capability, algorithm, or optimizer is one folder or one registry row.
Zero runtime dependencies — the core is pure Python stdlib.

Install

Requires Python 3.10+ and git.

git clone <repo> cap-evolve && cd cap-evolve
python3 -m venv .venv && source .venv/bin/activate   # recommended (isolated env)
pip install ./core                  # the honest-eval core (package: cap-evolve-core, CLI: cap-evolve)
pip install ./dashboard/backend     # optional: the live dashboard UI (cap-evolve run --dashboard auto)
./install.sh                        # optional: copy skills into your agent host's skills dir
cap-evolve version                  # verify the install

If your default pip index requires auth, append --index-url https://pypi.org/simple to the pip install lines (cap-evolve-core itself has zero runtime deps).

Optimizing a real agent additionally needs: a coding-agent CLI to act as the optimizer (e.g. claude, codex, gemini) with its credentials, and your runner's model credentials — all in a repo-root .env (e.g. ANTHROPIC_API_KEY, OPENAI_API_KEY, RITS_API_KEY, WATSONX_*). The toy example below needs none of this.

Toy example (zero-API)

cap-evolve

watch capability evolve

Contents: Why · Install · Toy example · tau2-bench example · Optimize your own · How it works · Comparison · Skill library · Results · How-to guides · License

Why cap-evolve

Optimizes prompts, tools/MCP, and skill packages — not just prompts. Pick one or several ([system-prompt, tools, mcp-tool, skill-package]) and optimize them jointly.
Code-bearing optimization, not just reword. For tools, the optimizer can edit tool code — add validation/loop/composite tools that enforce a rule or perform a stalled action in code (the fix for a behavioral failure prose can't reach) — and safely swap, never bare-remove, a primitive.
Onboard any benchmark/agent from a single prompt. Paste one intake brief to your coding agent; intake does the full integration — installs the benchmark, wires the adapter (incl. trajectories + a batched fast path), authors a capability-scoped optimizer prompt, and passes the cap-evolve check gate — before any budget is spent. No pre-integration.
Per-task causal feedback. Each iteration the optimizer sees which task ids a prior edit BROKE and FIXED plus the currently-passing set to protect — so it makes large, multi-cluster edits that don't regress the wins.
Honesty enforced in code, not docs. The sealed test split is scored exactly once and a paired, val-only significance gate (Δ > k·SE) decides every acceptance — both live in the cap_evolve core, the only place rewards are aggregated.
Host- and agent-agnostic. The optimizer is any coding-agent CLI (claude-code, codex, gemini, opencode, ibm-bob, …) resolved by one registry row. No framework lock-in.
Git-versioned iterations + optimizer memory — every candidate is a commit; rejected approaches are remembered and never re-proposed.
Per-iteration optimizer $ budget, enforced by the optimizer CLI itself (e.g. claude --max-budget-usd), plus hard total caps and a dry-run estimate.
Live dashboard — per-iteration optimizer & runner cost + time, intake cost, lineage tree, per-iteration diffs, and a tasks × iterations pass/fail heatmap.
Skills-native & trivially extensible — a new capability, algorithm, or optimizer is one folder or one registry row.
Zero runtime dependencies — the core is pure Python stdlib.

Install

Requires Python 3.10+ and git.

git clone <repo> cap-evolve && cd cap-evolve
python3 -m venv .venv && source .venv/bin/activate   # recommended (isolated env)
pip install ./core                  # the honest-eval core (package: cap-evolve-core, CLI: cap-evolve)
pip install ./dashboard/backend     # optional: the live dashboard UI (cap-evolve run --dashboard auto)
./install.sh                        # optional: copy skills into your agent host's skills dir
cap-evolve version                  # verify the install

If your default pip index requires auth, append --index-url https://pypi.org/simple to the pip install lines (cap-evolve-core itself has zero runtime deps).

cap-evolve

Popularity

Confidence

What's Inside

README

cap-evolve

Why cap-evolve

Install

Toy example (zero-API)

Similar Plugins

nature-skills

ecc

pr-review-toolkit

superpowers

fullstack-dev-skills

claude-md-management

cap-evolve

Why cap-evolve

Install

Toy example (zero-API)

Popularity

Health & Quality

Similar Plugins

nature-skills

ecc

pr-review-toolkit

superpowers

fullstack-dev-skills

claude-md-management