From agent-eval-harness
Analyzes SKILL.md, sub-skills, scripts, and tests to generate eval.yaml configs for agent eval harness including dataset schema, judges, and thresholds.
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessThis skill is limited to using the following tools:
You analyze a target skill and produce `eval.yaml` — the configuration that `/eval-run` needs. You read the skill deeply (including sub-skills it invokes), explore existing test cases, and generate everything: dataset schema, output descriptions, judges, and thresholds.
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
You analyze a target skill and produce eval.yaml — the configuration that /eval-run needs. You read the skill deeply (including sub-skills it invokes), explore existing test cases, and generate everything: dataset schema, output descriptions, judges, and thresholds.
The core principle: observe, don't assume. Every field name, file pattern, and directory path in the generated eval.yaml must come from reading actual files. If you can't point to a specific file or field you observed, don't put it in the config.
| Argument | Required | Default | Description |
|---|---|---|---|
--skill <name> | no | auto-detect | Which skill to analyze |
--config <path> | no | eval.yaml | Output path for the config |
--update | no | false | Fill in missing sections only, preserve user edits |
mkdir -p tmp
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/analyze-config.yaml \
skill=<skill> config=<config> update=<true/false>
If --skill was provided, locate its SKILL.md:
python3 ${CLAUDE_SKILL_DIR}/scripts/find_skills.py --name <skill>
If not provided, list all project skills:
python3 ${CLAUDE_SKILL_DIR}/scripts/find_skills.py
This reads .claude-plugin/plugin.json for custom skill paths, falls back to .claude/skills/ and skills/, and excludes eval harness skills. If only one skill is found, use it automatically. If multiple, ask the user which to analyze. If none are found, tell the user — they may need to check their skill directory paths or create a skill first.
If --update and eval.yaml already has a skill field: use that skill. If --skill is also provided and differs, ask the user which they mean — don't silently overwrite.
If eval.yaml already exists and --update was not set:
test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If it exists, validate it:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py config <config>
Then check if eval.md (the cached analysis) is still fresh — meaning the SKILL.md hasn't changed since the last analysis:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py memory eval.md
If FRESH and eval.yaml has a non-empty dataset.schema, at least one outputs entry with a schema, at least one judge, and models.skill set, report that config is up to date and exit. No work needed. (An INCOMPLETE config — empty sections, or missing models.skill from a pre-restructure eval.yaml — still needs analysis.)
If STALE, NO_CONFIG, or --update was set, proceed to full analysis.
This is the most important step — the quality of everything downstream depends on how thoroughly you understand the skill.
Launch an Explore agent to do the analysis:
${CLAUDE_SKILL_DIR}/prompts/analyze-skill.md to get the analysis instructionssubagent_type="Explore"The analysis is recursive — the agent follows sub-skill chains (Skill tool calls, /skill-name references) until it finds the skills that produce the final artifacts (typically 2-5 levels, capped at 5 to avoid circular references), reading each sub-skill's SKILL.md to trace the full pipeline. The outputs section must describe what the entire pipeline produces, not just the top-level orchestrator.
The agent returns structured YAML with: purpose, inputs, outputs, sub_skills, flags, pipeline, quality_criteria, and suggested_judges. See ${CLAUDE_SKILL_DIR}/prompts/analyze-skill.md for the full schema.
Verify the response: check that outputs reference actual directories and file patterns (not placeholders like <output-dir>), that sub_skills lists real skill names, and that suggested_judges include working code snippets. If anything looks fabricated, ask the agent to re-examine specific files.
First check if eval.yaml already has a dataset.path (from a previous run or --update):
ls <dataset_path>/ 2>/dev/null | head -20
If not set or doesn't exist, search the project for test case directories using the Glob tool:
Glob: **/cases/ or **/test-cases/ or **/fixtures/ or **/examples/
Exclude .venv/, .git/, node_modules/ from results.
If nothing found, ask the user where their test cases are (or will be).
If a cases directory exists, read one complete sample case — every file in it. Note:
This is what you'll describe in dataset.schema. If you didn't read the actual files, your schema description will be wrong — and downstream judges will fail because they expect fields that don't exist.
If no test cases exist, note this clearly and suggest running /eval-dataset to generate them. Describe the expected case structure in dataset.schema anyway — eval-dataset uses that description to create matching cases.
Combine the skill analysis (Step 3) and dataset exploration (Step 4) into a complete eval.yaml. Read the full template and writing guidance at ${CLAUDE_SKILL_DIR}/references/eval-yaml-template.md.
Key points:
execution.mode from the skill analysis (Step 3). If the analyzer returned ASK_USER, ask the user which mode to use — explain what the analyzer observed and let them decide. Do not default to case without evidence; a skill that processes collections of items internally (batch-size controls, multi-item iteration, multi-agent fan-out, result aggregation) is batch even if it also accepts a single item. See eval-yaml-template.md for the full mode selection guidance.execution.arguments. For case mode, build a template with {field} placeholders matching the input.yaml fields you observed in Step 4 (e.g., "{strat_key} {adr_file?}"). For batch mode, use the literal arguments string (e.g., "--input batch.yaml --headless").runner.type: claude-code is the default and almost always correct. Only change it if the user has explicitly mentioned another harness.models.skill to claude-opus-4-6 (the default for eval runs). Set models.judge to claude-opus-4-6 — LLM and pairwise judges need a strong model for accurate scoring. If the skill uses AskUserQuestion interactively (not --headless), set models.hook to claude-sonnet-4-6 for LLM-based question answering (fast enough for picking options, cheaper than Opus). CLI flags override.mlflow.experiment to <project>-eval (or leave blank — it falls back to the top-level name).dataset.schema and outputs[*].schema fields drive the entire pipeline — be specific, reference actual file/field names you observedJIRA_SERVER), annotate those fields in dataset.schema with [EXTERNAL: System] markers (e.g., 'project_key' ([EXTERNAL: Jira] — must be a real project key)). This tells /eval-dataset not to fabricate values for these fields. See eval-yaml-template.md for the convention.allowed-tools frontmatter includes Skill (meaning it invokes sub-skills), add "Skill" to permissions.allow. The Skill tool requires explicit permission in headless mode — without it, nested skill calls fail silently and the pipeline degrades.JIRA_SERVER for a jira-emulator, API keys for test instances), add execution.env entries. Use $VAR syntax for values that should be resolved from the caller's environment (e.g., $JIRA_TOKEN), or literal values for test-only endpoints (e.g., http://localhost:8080).inputs.tools entries. Use match to describe what to intercept in natural language (e.g., "any Jira interaction via MCP or scripts"), and prompt for how to handle it. The AskUserQuestion hook uses 3-tier answer resolution: exact match from case_overrides, then an LLM call (using models.hook) with the case's input.yaml and answers.yaml as context, then fallback to the first option. If the skill asks domain-specific questions (e.g., "is this a duplicate?"), suggest the user create answers.yaml files per case with guidance for the LLM answerer.outputs["annotations"] — the parsed annotations.yaml from the dataset case. Use this for outcome-aware scoring where the expected result depends on the test case (e.g., annotations.get("dedup_is_duplicate") determines whether producing no output is correct).check judges + 1-2 LLM prompt judges. Start lean.--update: preserve everything already in the file, only add missing top-level keys (e.g., add a models: block if the user is upgrading from an older config that lacked it)After writing eval.yaml, validate that all references are correct:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py config <config>
This checks dataset path exists, output paths are relative, judge prompt_file/context/module references resolve, and runner.settings exists.
Errors (exit code 1): fix before proceeding — broken file references, absolute paths, missing modules.
Warnings (exit code 0): may be expected — empty dataset (user hasn't created cases yet), missing judges (will be added later). Report them to the user but don't block.
The eval.md caches the skill analysis so it doesn't need to be repeated. The hash tracks only the top-level SKILL.md — if sub-skills change, the user should run /eval-analyze --update to refresh. Compute the skill hash:
python3 -c "import hashlib; from pathlib import Path; print(hashlib.sha256(Path('<skill-path>/SKILL.md').read_bytes()).hexdigest()[:12])"
Read the template at ${CLAUDE_SKILL_DIR}/prompts/generate-eval-md.md. Write eval.md with YAML frontmatter (skill, analyzed_at, skill_hash) and a markdown narrative of the analysis.
Tell the user what was generated:
<path> (M cases found)<hash>)/eval-dataset to generate test cases (required before eval-run)/eval-run --model <model> to execute the evaluationIf validation produced warnings, list them so the user knows what's incomplete.
$ARGUMENTS