Search everything...

Stats

Actions

Available In

Help us improve

Share bugs, ideas, or general feedback.

eval-runner - Claude Code Plugin | ClaudePluginHub

Plugin

eval-runner

Name: eval-runner
Author: danielrosehill

By danielrosehill

Scaffolding, running, documenting, and publishing AI evaluations. Ships skills and commands for setting up eval workspaces, creating custom evals (or adapting existing frameworks/benchmarks), running them, and publishing evals or datasets. Bundles a curated ground-truth list of open-source eval tools and benchmarks as a reference data source.

npx claudepluginhub danielrosehill/claude-eval-runner-plugin

Popularity

Stars

Med: 1·Avg: 289

Installs

Med: 0·Avg: 1

What's Inside

Slash Commands7

Create Eval

/create-eval

Design a custom eval (or remix an existing benchmark) — task spec, dataset plan, scoring rubric, reporting format.

Procedure

/document-eval

Write up the rationale and findings of an eval into a durable document under docs/.

New Workspace

/new-workspace

Provision a new eval-runner workspace (scaffold + optional GitHub repo).

Publish Dataset

/publish-dataset

Publish an eval dataset to Hugging Face Hub (or GitHub), with dataset card, content hash, and license.

Publish Eval

/publish-eval

Publish an eval (definition + results) to GitHub, Hugging Face Space, or a local bundle.

Agents1

eval-engineer

/eval-engineer

Autonomous evaluation engineering subagent. Use for multi-step eval work — choosing a framework, designing a custom eval, running it across multiple SUTs, writing findings, and publishing. Coordinates the plugin's skills end-to-end.

Skills6

create-eval

/create-eval

Design a custom eval from scratch, or remix an existing benchmark. Use when the user wants to define the eval itself — task framing, dataset composition, scoring rubric, and reporting format — rather than simply wiring up a framework. Produces a fully specified eval definition ready to be run.

new-workspace

/new-workspace

Provision a new eval-runner workspace on disk. Use when the user wants to start a new evaluation project — scaffolds evals/, datasets/, results/, and docs/ directories, personalises CLAUDE.md, and (by default) creates a GitHub repo.

publish-dataset

/publish-dataset

Publish an eval dataset to Hugging Face Hub (or GitHub as a fallback). Use when the user wants to share the inputs/labels used by an eval — with a dataset card, licensing, splits, and a content hash so downstream runs can verify integrity.

publish-eval

/publish-eval

Publish an eval (definition + results) so others can reproduce it. Use when the user wants to share an eval publicly — as a GitHub repo, Hugging Face space, or a standalone writeup. Produces a clean, self-contained bundle with README, task spec, rubric, dataset pointer, and a run report.

run-eval

/run-eval

Execute an eval defined in the current workspace and capture results with full metadata. Use when the user wants to actually run an eval (one or many SUTs), collect scored outputs under results/, and produce a run manifest so findings are reproducible and comparable over time.

Stats

Version0.1.0

Stars0

MaintenanceGood

LicenseMIT

Last CommitApr 24, 2026

AddedApr 24, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge.

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

Help us improve

Share bugs, ideas, or general feedback.

README

Claude-Eval-Runner-Plugin

A Claude Code plugin for setting up, running, documenting, and publishing AI evaluations — whether you're using an existing eval framework, adapting an existing benchmark, or designing a custom one from scratch.

What it's for

Evaluating AI systems is the unglamorous half of AI work. This plugin is a harness for the harness work: scaffold an eval workspace, pick (or remix) a framework, design a rubric, run the thing, write down what you found, and publish it when it's worth sharing.

Install

# from inside Claude Code
/plugin marketplace add <your-marketplace>
/plugin install eval-runner

Or add as a local plugin by pointing Claude Code at this directory.

Commands

Command	Purpose
`/eval-runner:new-workspace <name>`	Provision a new eval workspace (Train-Case name; optional `--private` / `--local-only`).
`/eval-runner:create-eval <slug>`	Design a custom eval — task spec, dataset plan, rubric. Use `--inspired-by` to remix an existing benchmark.
`/eval-runner:setup-eval <slug>`	Wire an eval to a framework. Use `--framework=` to pin, or let the plugin suggest based on `--type=`.
`/eval-runner:run-eval <slug>`	Execute an eval across one or more SUTs (`--sut=`), with manifest + summary under `results/`.
`/eval-runner:document-eval <slug>`	Write up rationale and findings into `docs/`.
`/eval-runner:publish-eval <slug>`	Publish the eval (and optionally results) to GitHub, a Hugging Face Space, or a local bundle.
`/eval-runner:publish-dataset <id>`	Publish a dataset to Hugging Face Hub with a dataset card and content hash.

Agent

eval-engineer — autonomous subagent that coordinates the full design → setup → run → document → publish loop for non-trivial eval work.

Ground truth

Bundled under data/awesome-ai-evaluations-tools.md is a snapshot of the Awesome AI Evaluations & Benchmarks list — a curated canon of open-source eval frameworks, benchmarks, and observability platforms. The plugin reads this before recommending any tool, and flags when it goes off-list.

See data/README.md for refresh instructions.

Workspace shape

<workspace>/
├── CLAUDE.md
├── README.md
├── evals/<slug>/           # BRIEF.md, TASK.md, RUBRIC.md, config, run.sh, judges/
├── datasets/<id>/          # data/, CARD.md, HASH
├── results/<slug>/<run>/   # manifest.yaml, SUMMARY.md, raw/
└── docs/                   # writeups and findings

Conventions

Eval slugs: kebab-case, descriptive (translation-he-en-quality, rag-legal-docs).
Run ids: YYYY-MM-DD-HHMM-<slug>-<shortsha>.
Results are append-only — rerun with a new id instead of editing.
Every run has a manifest (dataset hash, framework version, git sha, cost) so it's reproducible.
Repo names: Train-Case.

License

MIT.

Help us improve

Help us improve

Help us improve

eval-runner

Popularity

What's Inside

Help us improve

Health & Quality

Confidence

README

Claude-Eval-Runner-Plugin

What it's for

Install

Commands

Agent

Ground truth

Workspace shape

Conventions

License

Similar Plugins

planning-with-files

ecc

octo

unity-dev-toolkit

drawio-diagramming

claude-md-management

More by danielrosehill

user-manual

ideation-planning

learning

writing-editing

open-router-model-research

Claude-Eval-Runner-Plugin

What it's for

Install

Commands

Agent

Ground truth

Workspace shape

Conventions

License