From agent-eval-harness
Executes skill evaluations against test cases from eval.yaml, scores outputs with judges, reports results, benchmarks, regressions, and model comparisons.
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessThis skill is limited to using the following tools:
You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.
For the full data flow (dataset → workspace → execution → collection → scoring), see ${CLAUDE_SKILL_DIR}/references/data-pipeline.md. For tool interception mechanics, see ${CLAUDE_SKILL_DIR}/references/tool-interception.md.
Parse $ARGUMENTS:
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | eval.yaml | Path to eval config |
--model <model> | no | models.skill from config | Skill model. Required if models.skill is unset in eval.yaml. |
--subagent-model <model> | no | models.subagent → falls back to skill model | Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7) |
--skill <name> | no | from config | Override the skill to test |
--run-id <id> | no | YYYY-MM-DD-<model> | Identifier for this run |
--case <filter> | no | all cases | Substring match to select cases |
--baseline <run-id> | no | — | Previous run to compare against |
--no-judge | no | false | Skip LLM judges, run inline checks only |
--gold | no | false | Save outputs as gold references after run |
--effort <level> | no | runner.effort from config | Claude Code reasoning effort (low/medium/high/xhigh/max) |
Check if the config file exists (use the parsed config path, not hardcoded eval.yaml):
test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If config is missing: invoke eval-analyze to bootstrap:
Use the Skill tool to invoke /eval-analyze [--skill <skill>]
Once config exists, read it to understand the eval setup — the skill under test, runner, dataset, outputs, judges, models, and any tool interception. The downstream scripts read the same config; you don't need to pass these fields through, just confirm they're present and warn the user about anything missing or surprising.
If inputs.tools has entries but the skill uses AskUserQuestion or external APIs, verify the handlers cover those tools. Warn the user if a tool the skill uses isn't intercepted — headless execution may hang.
Persist parsed flags:
mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
gold=<true/false> no_judge=<true/false>
Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:
ls <dataset_path>/ | head -20
If --case filter was specified, note it for the workspace step.
If no cases found, stop and tell the user clearly:
/eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset pathBefore setting up the workspace, verify the project's artifact directories are clean. Skills write to the project directory (not the workspace), so stale artifacts from previous runs contaminate results — wrong IDs, stale run reports, inflated file counts.
python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
--config <config> \
[--run-id <id>]
The script checks tmp/ state files and whether $AGENT_EVAL_RUNS_DIR/<id> already has results from a previous run.
CLEAN: proceed to workspace setup.DIRTY: report the findings to the user and ask what to do:
preflight.py --clean --force to delete all stale artifacts, then proceed.2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.Create an isolated workspace with the test cases and output directories:
python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
--config <config> \
--run-id <id> \
[--case-filter <filter>]
The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.
If the case count is 0, stop — the filter matched nothing.
inputs.tools configured)If eval.yaml has inputs.tools entries, this step is mandatory. workspace.py emits a skeleton in tool_handlers.yaml; you must resolve each handler's prompt into concrete runtime checks (input_filters, env_checks, case_overrides). Do not skip this even when the eval.yaml is unchanged — the workspace is created fresh each time.
Read ${CLAUDE_SKILL_DIR}/references/tool-interception.md for the full format, field reference, and resolution examples. Then read <workspace>/tool_handlers.yaml, resolve every handler, and write it back.
Critical: any handler with patterns: [Bash, ...] and no input_filters is non-functional and will pass through unchecked.
Run the skill headlessly against test cases. In case mode (default), execute.py runs the skill once per case with case-specific arguments and workspace — each case gets its own stdout.log and subagent transcripts. In batch mode, all cases run in a single invocation via batch.yaml.
The execute script handles CLI construction, streaming progress, and result capture:
python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
--config <config> \
--workspace <workspace_path> \
--skill <skill_name> \
--skill-args "<skill arguments>" \
--model <model> \
--output $AGENT_EVAL_RUNS_DIR/<id> \
[--agent <runner>] \
[--subagent-model <model>] \
[--mlflow-experiment <name>] \
[--effort <level>] \
[--parallelism <n>]
Most flags fall back to the config:
--agent falls back to runner.type (default claude-code).--model falls back to models.skill. If neither is set, execute.py errors out.--mlflow-experiment falls back to mlflow.experiment.--skill-args falls back to execution.arguments. In case mode, {field} placeholders are resolved per case from input.yaml.--effort falls back to runner.effort (Claude Code only; ignored by other runners).--parallelism falls back to execution.parallelism. When > 1, cases run concurrently via thread pool. Each case gets its own log prefix (e.g., eval:case-003) so interleaved output is distinguishable.Override via CLI only when testing different combinations than what the config specifies.
Skill execution can take minutes to hours. Launch execute.py using the Bash tool with run_in_background: true. Do NOT pipe the command through tail, head, grep, or any other filter — piping buffers all output and prevents progress monitoring. The command must be the bare python3 ... execute.py ... invocation with no pipes.
Once launched, the Bash tool returns an output file path. Monitor progress by reading that file periodically:
# Check progress (repeat periodically)
tail -20 <output_file>
Look for phase markers (## Phase, ## Step, Batch N/M), agent counts (N agents launched, N/M done), and completion signals (Done). Summarize concisely — e.g., "Batch 2/4: review agents 3/5 complete" rather than dumping raw output.
Detecting problems: If the last lines haven't changed across two checks (~2-3 min apart), the pipeline may be stuck. Common signs:
sleep commands with no progress change → agents may have timed out or crashedERROR or Traceback in the output → script failure, report immediatelyexit code or EXIT: appearing → execution finished (check the code)When you spot an issue, report it to the user with the relevant output lines rather than waiting for completion.
After execution, check run_result.json for exit_code, duration_s, wall_clock_s, cost_usd, num_turns, and per-model token usage. duration_s is the sum of per-case durations; wall_clock_s is the actual elapsed time (lower when parallelism is used). Read it with cat (it's JSON — state.py would corrupt it to YAML).
cat $AGENT_EVAL_RUNS_DIR/<id>/run_result.json
If exit_code is non-zero, report the failure with the exit code, duration, and the first few lines of $AGENT_EVAL_RUNS_DIR/<id>/stderr.log. Do not continue to scoring.
Distribute workspace outputs into per-case directories so judges can score each case independently:
python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
--config <config> \
--workspace <workspace_path> \
--output $AGENT_EVAL_RUNS_DIR/<id>
Read the collection summary (JSON file — do not use state.py on it):
cat $AGENT_EVAL_RUNS_DIR/<id>/collection.json
Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.
Run all configured judges against the collected outputs. Skip this step if --no-judge was specified.
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
--run-id <id> \
--config <config>
Judges receive a record dict with:
outputs["files"], outputs["<dir>_content"]outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)outputs["tool_calls"] (if outputs has tool: entries)outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.
If --baseline was specified, also run pairwise comparison:
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
--run-id <id> \
--baseline <baseline_id> \
--config <config>
Read the full results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<id>/summary.yaml
summary.yaml has three sections: judges (per-judge mean and pass_rate), per_case (per-case {value, rationale} per judge), and pairwise (only if --baseline was used: run_a, run_b, wins_a, wins_b, ties).
Read the summary and analyze the results. Read ${CLAUDE_SKILL_DIR}/prompts/analyze-results.md for the full analysis framework — it covers aggregate assessment, failure patterns, root causes, regressions, cost attribution, and recommendations. Lead with the Recommendation so the call-to-action is the first thing the reader sees. Be decisive — state assessments, not hedges.
Save analysis to file so it persists in the report. Prepend YAML frontmatter recording the agent and model that wrote the analysis, plus the UTC timestamp — the report uses these to attribute the analysis in its subtitle:
cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF'
---
agent: Claude Code # the agent/runtime writing this analysis (e.g. Claude Code)
model: <your-model-id> # e.g. claude-opus-4-7, claude-sonnet-4-6 — the model backing the agent
date: <UTC ISO 8601> # e.g. 2026-04-17T14:32:11Z
---
<your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions>
EOF
Write the analysis body as markdown with these sections in order: ## Recommendation (verdict + top actions), ## Summary (aggregate scores, run metrics), ## Failure Patterns, ## Root Causes, ## Regressions (only if --baseline was provided), ## Cost Attribution (always — cite run_metrics plus a derived cost_per_<unit>). The Recommendation must be self-contained — many readers will only read that section. This file is rendered as a prominent callout near the top of the HTML report; the frontmatter is consumed by the report renderer and not displayed verbatim.
Generate HTML report:
python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
--run-id <id> \
--config <config> \
[--baseline <baseline_id>] \
--open
Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<id>/report.html.
If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.
Suggest next steps (include --config <config> if a non-default config was used):
/eval-review --run-id <id> for interactive human review of the results/eval-optimize --model <model> for automated improvement based on failures/eval-mlflow --run-id <id> to log results to MLflowIf mlflow.experiment is configured in eval.yaml:
Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.$ARGUMENTS