Help us improve
Share bugs, ideas, or general feedback.
From eval-runner
Execute an eval defined in the current workspace and capture results with full metadata. Use when the user wants to actually run an eval (one or many SUTs), collect scored outputs under results/, and produce a run manifest so findings are reproducible and comparable over time.
npx claudepluginhub danielrosehill/claude-eval-runner-pluginHow this skill is triggered — by the user, by Claude, or both
Slash command
/eval-runner:run-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Executes an eval from `evals/<slug>/`, captures outputs and scores under `results/<slug>/<run-id>/`, and writes a run manifest.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Executes an eval from evals/<slug>/, captures outputs and scores under results/<slug>/<run-id>/, and writes a run manifest.
evals/.--sut=<name>[,<name>...] — models / prompts / pipelines under test. Required unless the eval has a default SUT list.--n=<int> — sample size override (for smoke runs).--tag=<label> — free-form label recorded in the manifest (e.g. baseline, post-prompt-fix).--dry-run — validate config, print the plan, do not call any model.Locate the eval. Read evals/<slug>/BRIEF.md, TASK.md (if present), RUBRIC.md, and the framework config. If any are missing, stop and suggest create-eval / setup-eval.
Pre-flight checks.
DATASET.md (if recorded).Assign a run id. YYYY-MM-DD-HHMM-<slug>-<shortsha> where shortsha is git rev-parse --short HEAD if the workspace is a repo, else a random 6-char nonce.
Execute. Invoke the framework's runner (prefer evals/<slug>/run.sh). Stream outputs to results/<slug>/<run-id>/raw/.
Score. Apply the rubric. For LLM-as-judge: call the judge model on each item, cache judgments, and store the full trace (not just the score).
Write the manifest. results/<slug>/<run-id>/manifest.yaml:
run_id: ...
eval_slug: ...
started_at: ...
finished_at: ...
suts: [ ... ]
judge_model: ...
dataset_hash: ...
framework: ...
framework_version: ...
sample_size: ...
cost_usd: ...
tag: ...
git_sha: ...
overall_scores: { sut_a: 0.82, sut_b: 0.79 }
Write a short summary. results/<slug>/<run-id>/SUMMARY.md — headline number, per-stratum breakdown, notable failures (link to raw/), comparison to prior runs if any exist.
Offer follow-ups. Suggest /eval-runner:document-eval to write up findings, or /eval-runner:publish-eval to share.