From agent-eval-harness
Generates realistic evaluation test cases for skills by analyzing eval.md and eval.yaml. Bootstraps starter datasets or expands coverage for /eval-run.
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessThis skill is limited to using the following tools:
You generate evaluation test cases for a skill. You read the skill analysis (eval.md) and eval config (eval.yaml) to understand what the skill does, then create realistic test cases that match the dataset schema. The goal is giving `/eval-run` something meaningful to test against.
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
You generate evaluation test cases for a skill. You read the skill analysis (eval.md) and eval config (eval.yaml) to understand what the skill does, then create realistic test cases that match the dataset schema. The goal is giving /eval-run something meaningful to test against.
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | eval.yaml | Path to eval config |
--count <N> | no | 5 | Number of cases to generate |
--strategy <type> | no | bootstrap | Generation strategy (see Step 3) |
Read eval.yaml and eval.md to understand:
execution.mode (case or batch) and execution.arguments (the argument template). In case mode, {field} placeholders in the arguments are resolved per case from input.yaml — every field referenced in the template (e.g., {strat_key}, {prompt}) must exist in the generated input.yaml files.dataset.schema describes the case structure (files, fields, formats)outputs[*].schema describes what the skill produces (informs what reference outputs look like)check snippets reveal exact validation logic — what fields are accessed, what thresholds are used, what conditions trigger pass/failprompt / prompt_file text describes quality dimensions (completeness, accuracy, etc.)description summarizes what each judge evaluatesBuild a list of judge-driven requirements — these are the concrete things judges will check. Each test case should be designed to exercise at least one of these requirements. For example:
len(content) >= 100 → include a case with minimal input that might produce short outputIf eval.yaml doesn't exist, ask the user which skill to evaluate, then invoke /eval-analyze to create the config:
Use the Skill tool to invoke /eval-analyze --skill <skill-name>
Wait for the analysis to complete, then re-read eval.yaml. If /eval-analyze fails or the user skips it, you cannot generate meaningful cases — stop and explain why.
If eval.md doesn't exist, you can still work from eval.yaml's schema descriptions, but the cases will be less targeted.
Read dataset.schema and extract a concrete checklist:
Required files — what files each case directory must contain (e.g., input.yaml, reference.md)
Required fields per file — for structured files like YAML/JSON, which fields are mandatory
Optional fields — fields described with "optionally" or "if available" — vary these across cases (include in some, omit in others) to test the skill's handling of missing optional context
Field semantics — what kind of content each field expects (e.g., "problem statement", "clarifying context", "priority level"). Use these descriptions to generate realistic content, not generic placeholders
Naming patterns — any file naming conventions mentioned (e.g., "named NNN-slug.md")
Argument fields — if execution.mode is case, parse execution.arguments for {field} placeholders. Every placeholder must appear as a required field in input.yaml. Cross-check against items 1-2 above — if {strat_key} is in the arguments but not in the schema, add it as a required field.
External-state fields — look for fields marked with [EXTERNAL: System] in the schema description. These reference real resources in external systems (Jira projects, GitHub repos, API endpoints) that must exist at execution time. Do NOT invent values for these fields — fabricated values (e.g., a Jira project key derived from the repo directory name) cause silent failures when the skill queries the external system and gets zero results. Mark these in your generation template as requiring TODO_ placeholder values (see Step 5).
This checklist is your generation template. Every case must satisfy items 1-2 and 6. Items 3-4 guide content variety.
Check what already exists:
ls <dataset_path>/ 2>/dev/null | head -20
Count existing cases and read one to understand the current structure. Note:
bootstrap (default) — Generate N cases from scratch. Use this when starting from zero or when fewer than 5 cases exist.
Design cases to cover:
expand — Read existing cases, identify gaps, generate cases that fill them. Use this when cases exist but coverage is thin.
Read each existing case's input file to understand what's already covered. Then look for gaps by comparing against:
Avoid duplicating existing scenarios — each new case should test something distinct that isn't already covered. Number new cases continuing from the highest existing case number.
from-traces — Extract real inputs from MLflow traces and turn them into test cases. Use this when the skill has been used in production and traces are available.
Run the extraction script:
python3 ${CLAUDE_SKILL_DIR}/../eval-mlflow/scripts/from_traces.py \
--config <config> \
--count <N>
This outputs YAML with extracted trace inputs (prompt text, tool interactions). Read the output and create case directories following the generation template from Step 2. The trace inputs give you realistic content for the input fields — but you still need to structure the files according to dataset.schema.
If the script exits with code 2 (no traces found) or MLflow is not configured, tell the user and fall back to expand strategy.
For each case, create a directory under dataset.path following the structure described in dataset.schema.
Naming: Use descriptive directory names that indicate what the case tests:
case-001-simple-basic-input/
case-002-complex-multi-requirement/
case-003-edge-empty-context/
case-004-long-detailed-input/
case-005-ambiguous-phrasing/
Content: Use the generation template from Step 2. Every case must include all required files and fields. Vary optional fields across cases — include them in some, omit in others. Use the field semantics to generate realistic content appropriate to each field's purpose.
Realism: Cases should look like something a real user would encounter. Don't generate lorem ipsum or obviously synthetic inputs. Use realistic names, scenarios, and domain language appropriate to the skill.
External-state placeholders: For fields marked [EXTERNAL: System] in the schema, use TODO_<SYSTEM>_<FIELD> as the value (e.g., project_key: "TODO_JIRA_PROJECT_KEY"). If you want to show a plausible real value, put it in a YAML comment (e.g., # replace with real key, such as MYPROJECT). The TODO_ prefix signals that this must be replaced with a real value from the target system before execution. List all placeholders in Step 7 so the user knows what needs manual review.
Answers for interactive skills: If eval.yaml has inputs.tools entries for AskUserQuestion, the skill asks questions during execution. The hook uses LLM-based answering (via models.hook) that reads input.yaml and answers.yaml from each case as context. Create answers.yaml with guidance that tells the LLM how to answer domain-specific questions for this case:
# answers.yaml — LLM answerer guidance for this case
dedup_is_duplicate: true
dedup_guidance: >
This RFE is intentionally a rephrased version of an existing RFE
about model signature verification. If asked whether existing RFEs
cover this need, the answer is yes.
The LLM reads these fields alongside the question and options to pick the right answer. For general clarifying questions, the LLM uses input.yaml context — no answers.yaml needed. Only create answers.yaml when the case has domain-specific decisions (e.g., "is this a duplicate?", "should this be split?") where the correct answer depends on the test scenario.
If unsure what questions the skill asks, you can leave answers.yaml out — the hook still calls the LLM using input.yaml context and the handler prompt, falling back to the first option only if the LLM call fails.
Annotations for outcome-aware judges: Judges receive outputs["annotations"] — the parsed annotations.yaml from each case. If the eval config has judges that check expected outcomes (e.g., annotations.get("dedup_is_duplicate") to determine whether no output is correct), add the relevant fields to each case's annotations.yaml:
# annotations.yaml — fields for outcome-aware judges
dedup_is_duplicate: true # or false — tells judges whether no RFE is expected
tags: [dedup, high-overlap]
known_issues:
- dedup should flag this as overlapping with RHAIRFE-1001
Check the eval.yaml judges section for any check snippets that access outputs.get("annotations", {}) — those fields must exist in annotations.yaml for the judge to work correctly.
Companion files: If eval.md lists companion_files (files the skill reads from disk at runtime — e.g., strategy.md, adr.md), each test case must include them. In case mode, the harness copies all case files into the workspace, so the skill will find them at their expected relative paths. Generate realistic content for these files appropriate to each case's scenario.
Reference outputs: Only include gold standard reference files if you can confidently produce a correct output. It's better to leave references out (the user can generate them later with /eval-run --gold) than to include incorrect ones that mislead judges.
After generating, verify the cases:
dataset.schema describes?execution.mode is case, verify that input.yaml contains all fields referenced by {field} placeholders in execution.argumentsls <dataset_path>/case-001-*/
Tell the user what was created:
<path>TODO_ placeholder values were generated, list each one with which case it's in, which external system it references, and what kind of value is needed (e.g., "case-001/input.yaml TODO_JIRA_PROJECT_KEY — needs a real Jira project key from your test instance"). These MUST be replaced with real values before running /eval-run.--config <config> if a non-default config was used):
/eval-run --model <model> to test the skill against these cases/eval-run --model <model> --gold to generate gold references from the best outputs/eval-dataset --strategy expand --count 10 to add more cases laterdataset.schema says "input.yaml with a 'prompt' field", create input.yaml with a prompt field. Not input.json, not query.yaml.case-003-edge-empty-context is better than case-003. The name should indicate what scenario is being tested.$ARGUMENTS