Skill

eval-optimize

Automates skill improvement by running evals, identifying judge failures from traces and rationale, editing SKILL.md to fix issues, re-running to verify fixes, and checking regressions. Use after /eval-run results to boost scores until judges pass.

Bash

Python

code-quality

automation

npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harness

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBashGlobGrepAgentSkillAskUserQuestion

Preview

You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.

Supporting Assets

scripts/agent_eval

SKILL.md

Similar Skills

strategic-compact

179.0k

Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.

1 file

ecc

Stats

Stars7

Forks8

Last CommitApr 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

eval-optimize

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from eval.yaml	Model to use for eval runs (overrides config default)
`--max-iterations <N>`	no	3	Stop after N improvement cycles
`--run-id <id>`	no	auto-generated	Base run ID (iterations append `-iter-N`)
`--target-judge <name>`	no	all judges	Focus on a specific failing judge

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from eval.yaml

Model to use for eval runs (overrides config default)

--max-iterations <N>

Stop after N improvement cycles

--run-id <id>

auto-generated

Base run ID (iterations append -iter-N)

--target-judge <name>

all judges

Focus on a specific failing judge

Agent tool, subagent_type="Explore": "Read the transcript at <path> and report: - Did the skill follow its own instructions? Which were unclear? - Did it take roundabout paths or try multiple approaches? - Did sub-skills behave unexpectedly? - Were there errors that were silently recovered?"

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from eval.yaml	Model to use for eval runs (overrides config default)
`--max-iterations <N>`	no	3	Stop after N improvement cycles
`--run-id <id>`	no	auto-generated	Base run ID (iterations append `-iter-N`)
`--target-judge <name>`	no	all judges	Focus on a specific failing judge

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from eval.yaml

Model to use for eval runs (overrides config default)

--max-iterations <N>

Stop after N improvement cycles

--run-id <id>

auto-generated

Base run ID (iterations append -iter-N)

--target-judge <name>

all judges

Focus on a specific failing judge

eval-optimize

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

eval-optimize

Tool Access

Preview

Supporting Assets

SKILL.md

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

Similar Skills

Help us improve

Step 0: Parse Arguments

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules