Help us improve
Share bugs, ideas, or general feedback.
From eval-runner
Design a custom eval from scratch, or remix an existing benchmark. Use when the user wants to define the eval itself — task framing, dataset composition, scoring rubric, and reporting format — rather than simply wiring up a framework. Produces a fully specified eval definition ready to be run.
npx claudepluginhub danielrosehill/claude-eval-runner-pluginHow this skill is triggered — by the user, by Claude, or both
Slash command
/eval-runner:create-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Where `setup-eval` wires up a framework, `create-eval` designs the eval's substance: what's tested, how it's scored, what "good" looks like. Use together — create first, then setup.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Where setup-eval wires up a framework, create-eval designs the eval's substance: what's tested, how it's scored, what "good" looks like. Use together — create first, then setup.
$ARGUMENTS may include:
--inspired-by=<ref> — existing benchmark/eval to remix (e.g. MMLU, HumanEval, MT-Bench, a HF dataset, a paper).--mode=<scratch|remix> — default scratch unless --inspired-by is set.Interview the user. Before writing, answer these in conversation or BRIEF.md:
Task spec. Write evals/<slug>/TASK.md:
Dataset design. Write evals/<slug>/DATASET.md:
Scoring rubric. Write evals/<slug>/RUBRIC.md:
Remix mode. If --inspired-by is set, explicitly record in TASK.md:
Reporting format. Write evals/<slug>/REPORT_TEMPLATE.md — the structure results will be written into so that publish-eval / document-eval have a fixed target.
Handoff. Suggest running /eval-runner:setup-eval to wire the design to a framework, or writing a minimal custom runner if nothing in data/awesome-ai-evaluations-tools.md fits.