Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Scaffolding, running, documenting, and publishing AI evaluations. Ships skills and commands for setting up eval workspaces, creating custom evals (or adapting existing frameworks/benchmarks), running them, and publishing evals or datasets. Bundles a curated ground-truth list of open-source eval tools and benchmarks as a reference data source.
npx claudepluginhub danielrosehill/claude-eval-runner-pluginDesign a custom eval (or remix an existing benchmark) — task spec, dataset plan, scoring rubric, reporting format.
Write up the rationale and findings of an eval into a durable document under docs/.
Provision a new eval-runner workspace (scaffold + optional GitHub repo).
Publish an eval dataset to Hugging Face Hub (or GitHub), with dataset card, content hash, and license.
Publish an eval (definition + results) to GitHub, Hugging Face Space, or a local bundle.
Design a custom eval from scratch, or remix an existing benchmark. Use when the user wants to define the eval itself — task framing, dataset composition, scoring rubric, and reporting format — rather than simply wiring up a framework. Produces a fully specified eval definition ready to be run.
Provision a new eval-runner workspace on disk. Use when the user wants to start a new evaluation project — scaffolds evals/, datasets/, results/, and docs/ directories, personalises CLAUDE.md, and (by default) creates a GitHub repo.
Publish an eval dataset to Hugging Face Hub (or GitHub as a fallback). Use when the user wants to share the inputs/labels used by an eval — with a dataset card, licensing, splits, and a content hash so downstream runs can verify integrity.
Publish an eval (definition + results) so others can reproduce it. Use when the user wants to share an eval publicly — as a GitHub repo, Hugging Face space, or a standalone writeup. Produces a clean, self-contained bundle with README, task spec, rubric, dataset pointer, and a run report.
Execute an eval defined in the current workspace and capture results with full metadata. Use when the user wants to actually run an eval (one or many SUTs), collect scored outputs under results/, and produce a run manifest so findings are reproducible and comparable over time.
Uses power tools
Uses Bash, Write, or Edit tools
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Manus-style persistent markdown files for planning, progress tracking, and knowledge storage. Works with Claude Code, Kiro, Clawd CLI, Gemini CLI, Cursor, Continue, Hermes, and 17+ AI coding assistants. Now with Arabic, German, Spanish, and Chinese (Simplified & Traditional) support.
Battle-tested Claude Code plugin for engineering teams — 60 agents, 230 skills, 75 legacy command shims, production-ready hooks, and selective install workflows evolved through continuous real-world use
v9.38.0 — Agent summaries, prompt-size preflight, research fanout, and Codex-compatible portable skills. Run /octo:setup.
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development
Intelligent draw.io diagramming plugin with AI-powered diagram generation, multi-platform embedding (GitHub, Confluence, Azure DevOps, Notion, Teams, Harness), conditional formatting, live data binding, and MCP server integration for programmatic diagram creation and management.
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.
Claude Code plugin for generating personal user manuals and private documentation for codebases. Creates personalized, private reference guides with PDF output support.
Claude Code plugin: ideation and planning workflow — capture, evaluate, rank, simulate, and plan ideas, with ideation/single-idea-eval/multi-idea-ranking/feature-ideas/simulation/idea-capture variants.
Claude Code plugin for learning resources, skill development, educational content creation, and knowledge management.
Claude Code plugin for writing assistance, proofreading, style editing, and text transformation workflows.
Research, filter, compare, and evaluate AI models on OpenRouter — discover models by capability (tool use, vision, audio), get cost/context-aware recommendations, run head-to-head comparisons, and conduct deep research that goes beyond the OpenRouter catalog.
A Claude Code plugin for setting up, running, documenting, and publishing AI evaluations — whether you're using an existing eval framework, adapting an existing benchmark, or designing a custom one from scratch.
Evaluating AI systems is the unglamorous half of AI work. This plugin is a harness for the harness work: scaffold an eval workspace, pick (or remix) a framework, design a rubric, run the thing, write down what you found, and publish it when it's worth sharing.
# from inside Claude Code
/plugin marketplace add <your-marketplace>
/plugin install eval-runner
Or add as a local plugin by pointing Claude Code at this directory.
| Command | Purpose |
|---|---|
/eval-runner:new-workspace <name> | Provision a new eval workspace (Train-Case name; optional --private / --local-only). |
/eval-runner:create-eval <slug> | Design a custom eval — task spec, dataset plan, rubric. Use --inspired-by to remix an existing benchmark. |
/eval-runner:setup-eval <slug> | Wire an eval to a framework. Use --framework= to pin, or let the plugin suggest based on --type=. |
/eval-runner:run-eval <slug> | Execute an eval across one or more SUTs (--sut=), with manifest + summary under results/. |
/eval-runner:document-eval <slug> | Write up rationale and findings into docs/. |
/eval-runner:publish-eval <slug> | Publish the eval (and optionally results) to GitHub, a Hugging Face Space, or a local bundle. |
/eval-runner:publish-dataset <id> | Publish a dataset to Hugging Face Hub with a dataset card and content hash. |
eval-engineer — autonomous subagent that coordinates the full design → setup → run → document → publish loop for non-trivial eval work.Bundled under data/awesome-ai-evaluations-tools.md is a snapshot of the Awesome AI Evaluations & Benchmarks list — a curated canon of open-source eval frameworks, benchmarks, and observability platforms. The plugin reads this before recommending any tool, and flags when it goes off-list.
See data/README.md for refresh instructions.
<workspace>/
├── CLAUDE.md
├── README.md
├── evals/<slug>/ # BRIEF.md, TASK.md, RUBRIC.md, config, run.sh, judges/
├── datasets/<id>/ # data/, CARD.md, HASH
├── results/<slug>/<run>/ # manifest.yaml, SUMMARY.md, raw/
└── docs/ # writeups and findings
translation-he-en-quality, rag-legal-docs).YYYY-MM-DD-HHMM-<slug>-<shortsha>.MIT.