Skill

eval-run

Executes skill evaluations against test cases from eval.yaml, scores outputs with judges, reports results, benchmarks, regressions, and model comparisons.

testing

npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harness

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBashGlobGrepAgentSkillAskUserQuestion

Preview

You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.

Supporting Assets

prompts/analyze-results.mdprompts/comparison-judge.mdreferences/data-pipeline.mdreferences/tool-interception.mdscripts/agent_evalscripts/collect.pyscripts/execute.pyscripts/preflight.pyscripts/report.pyscripts/score.pyscripts/tools.pyscripts/workspace.py

SKILL.md

Similar Skills

strategic-compact

179.0k

Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.

1 file

ecc

Stats

Stars7

Forks8

Last CommitMay 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from config	Skill model. Required if `models.skill` is unset in eval.yaml.
`--subagent-model <model>`	no	`models.subagent` → falls back to skill model	Model for subagents (e.g., `claude-sonnet-4-6` while main is `claude-opus-4-7`)
`--skill <name>`	no	from config	Override the skill to test
`--run-id <id>`	no	`YYYY-MM-DD-<model>`	Identifier for this run
`--case <filter>`	no	all cases	Substring match to select cases
`--baseline <run-id>`	no	—	Previous run to compare against
`--no-judge`	no	false	Skip LLM judges, run inline checks only
`--gold`	no	false	Save outputs as gold references after run
`--effort <level>`	no	`runner.effort` from config	Claude Code reasoning effort (`low`/`medium`/`high`/`xhigh`/`max`)

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from config

Skill model. Required if models.skill is unset in eval.yaml.

--subagent-model <model>

models.subagent → falls back to skill model

Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7)

--skill <name>

from config

Override the skill to test

--run-id <id>

YYYY-MM-DD-<model>

Identifier for this run

--case <filter>

all cases

Substring match to select cases

--baseline <run-id>

—

Previous run to compare against

--no-judge

false

Skip LLM judges, run inline checks only

--gold

false

Save outputs as gold references after run

--effort <level>

runner.effort from config

Claude Code reasoning effort (low/medium/high/xhigh/max)

mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs} python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \ model=<model> skill=<skill> run_id=<id> baseline=<baseline> \ gold=<true/false> no_judge=<true/false>

python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \ --config <config> \ --workspace <workspace_path> \ --skill <skill_name> \ --skill-args "<skill arguments>" \ --model <model> \ --output $AGENT_EVAL_RUNS_DIR/<id> \ [--agent <runner>] \ [--subagent-model <model>] \ [--mlflow-experiment <name>] \ [--effort <level>] \ [--parallelism <n>]

cat > $AGENT_EVAL_RUNS_DIR/<id>/analysis.md << 'EOF' --- agent: Claude Code # the agent/runtime writing this analysis (e.g. Claude Code) model: <your-model-id> # e.g. claude-opus-4-7, claude-sonnet-4-6 — the model backing the agent date: <UTC ISO 8601> # e.g. 2026-04-17T14:32:11Z --- <your full analysis — Recommendation first, then Summary, Failure Patterns, Root Causes, Regressions> EOF

Argument	Required	Default	Description
`--config <path>`	no	`eval.yaml`	Path to eval config
`--model <model>`	no	`models.skill` from config	Skill model. Required if `models.skill` is unset in eval.yaml.
`--subagent-model <model>`	no	`models.subagent` → falls back to skill model	Model for subagents (e.g., `claude-sonnet-4-6` while main is `claude-opus-4-7`)
`--skill <name>`	no	from config	Override the skill to test
`--run-id <id>`	no	`YYYY-MM-DD-<model>`	Identifier for this run
`--case <filter>`	no	all cases	Substring match to select cases
`--baseline <run-id>`	no	—	Previous run to compare against
`--no-judge`	no	false	Skip LLM judges, run inline checks only
`--gold`	no	false	Save outputs as gold references after run
`--effort <level>`	no	`runner.effort` from config	Claude Code reasoning effort (`low`/`medium`/`high`/`xhigh`/`max`)

Argument

Required

Default

Description

--config <path>

eval.yaml

Path to eval config

--model <model>

models.skill from config

Skill model. Required if models.skill is unset in eval.yaml.

--subagent-model <model>

models.subagent → falls back to skill model

Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7)

--skill <name>

from config

Override the skill to test

--run-id <id>

YYYY-MM-DD-<model>

Identifier for this run

--case <filter>

all cases

Substring match to select cases

--baseline <run-id>

—

Previous run to compare against

--no-judge

false

Skip LLM judges, run inline checks only

--gold

false

Save outputs as gold references after run

--effort <level>

runner.effort from config

Claude Code reasoning effort (low/medium/high/xhigh/max)

eval-run

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

eval-run

Tool Access

Preview

Supporting Assets

SKILL.md

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Similar Skills

Help us improve

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

eval-run

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

eval-run

Tool Access

Preview

Supporting Assets

SKILL.md

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Similar Skills

Help us improve

Step 0: Parse Arguments and Load Config

Step 1: Find Dataset

Step 2: Preflight Check

Step 3: Prepare Workspace

Step 3b: Resolve Tool Interception (if inputs.tools configured)

Step 4: Execute Skill

Monitoring Progress

Step 5: Collect Artifacts

Step 6: Score

Step 7: Interpret and Report

Step 8: Log to MLflow (optional)

Rules

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)

Step 3b: Resolve Tool Interception (if `inputs.tools` configured)