From agent-eval-harness
Automates skill improvement by running evals, identifying judge failures from traces and rationale, editing SKILL.md to fix issues, re-running to verify fixes, and checking regressions. Use after /eval-run results to boost scores until judges pass.
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessThis skill is limited to using the following tools:
You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.
The key difference from /eval-review: you act autonomously. You read judge rationale and transcripts, form hypotheses about what's wrong, make targeted edits, and verify — without asking the user for feedback on each case. The user sets the goal ("make this pass") and you work toward it.
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | eval.yaml | Path to eval config |
--model <model> | no | models.skill from eval.yaml | Model to use for eval runs (overrides config default) |
--max-iterations <N> | no | 3 | Stop after N improvement cycles |
--run-id <id> | no | auto-generated | Base run ID (iterations append -iter-N) |
--target-judge <name> | no | all judges | Focus on a specific failing judge |
mkdir -p tmp
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/optimize-config.yaml \
model=<model> max_iterations=<N> run_id=<id> target_judge=<judge>
If no recent eval results exist, run the eval suite first:
Use the Skill tool to invoke /eval-run --run-id <id>-iter-0 --config <config> [--model <model>]
Pass --model only if the user provided one — otherwise let /eval-run fall back to models.skill from eval.yaml. Pass the same model on every iteration for comparable results.
If results already exist (the user just ran /eval-run), skip this and use the existing run.
Read the results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>-iter-0/summary.yaml
If all judges pass, report success and exit — nothing to improve.
From summary.yaml, identify:
Also check for human feedback — these catch things judges miss:
test -f $AGENT_EVAL_RUNS_DIR/<id>/review.yaml && echo "REVIEW_EXISTS" || echo "NO_REVIEW"
If review.yaml exists, read its feedback section (human feedback from /eval-review) and mlflow_feedback section (annotations pulled from MLflow UI). Human feedback is higher-signal than judge rationale — prioritize issues the user flagged.
If --target-judge was specified, focus only on that judge's failures.
Build a failure map:
judge_name → [case_id, case_id, ...] → rationale for each
human_review → [case_id, ...] → user comment for each
For each failure pattern, investigate why the skill produces bad output:
Read the skill's SKILL.md — locate it via eval.yaml's skill field:
python3 ${CLAUDE_SKILL_DIR}/../eval-analyze/scripts/find_skills.py --name <skill>
Read transcripts (if available) — transcripts can be very large, so delegate to an Agent. Check run_result.json for execution_mode: in case mode, each case has its own transcript at $AGENT_EVAL_RUNS_DIR/<id>/cases/<case>/stdout.log; in batch mode, there's one at $AGENT_EVAL_RUNS_DIR/<id>/stdout.log. Focus on the failing cases.
Agent tool, subagent_type="Explore": "Read the transcript at <path> and report:
- Did the skill follow its own instructions? Which were unclear?
- Did it take roundabout paths or try multiple approaches?
- Did sub-skills behave unexpectedly?
- Were there errors that were silently recovered?"
Read failing case outputs — use an Explore agent to examine the actual output files for failing cases. Don't read them all into your context — delegate:
Agent tool, subagent_type="Explore": "Read the outputs in $AGENT_EVAL_RUNS_DIR/<id>/cases/<failing_case>/
and compare against what the judges expected. What went wrong?"
Form hypotheses — connect the judge rationale + transcript evidence + output examination to specific parts of the SKILL.md. Be specific: "The judge says the output is missing acceptance criteria. The transcript shows the skill skipped Step 4 of the pipeline. Step 4 in SKILL.md says 'optionally add acceptance criteria' — the word 'optionally' is the problem."
Apply targeted fixes to the SKILL.md. For each edit:
Show each edit before applying. If the change is risky (could affect passing cases), note it.
Execution mode context: check execution.mode in eval.yaml. In case mode, each case runs in its own isolated workspace with all case files copied in — the skill receives case-specific arguments resolved from input.yaml. In batch mode, all cases are in one workspace via batch.yaml. Your edits must work for the configured mode.
Run eval again with the baseline flag to detect regressions:
Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --baseline <id>-iter-<N-1> --config <config> [--model <model>]
Read the new results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>-iter-<N>/summary.yaml
Check:
If the fix caused regressions (previously passing cases now fail):
/eval-review --run-id <id> for human input on the tricky cases.If failures remain and iterations < max:
If all judges pass:
If max iterations reached with failures remaining:
/eval-review --run-id <final-id> for human assessment of the remaining issues/eval-dataset --strategy expand if failures suggest missing test coverageIn all cases (include --config <config> if a non-default config was used):
/eval-mlflow --run-id <final-id> to log the optimization results to MLflow for tracking.$ARGUMENTS