From eval-framework
Design evaluation criteria and a 1-5 scoring rubric for a task or LLM output. Use this skill when asked to "create an eval", "define evaluation criteria", "build a scoring rubric", or "design how to measure quality" for any output.
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkThis skill uses the workspace's default tool permissions.
Design a structured evaluation framework with explicit scoring criteria for the given task.
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Design a structured evaluation framework with explicit scoring criteria for the given task.
Read the task description and identify:
Create 3–6 independent dimensions that together cover output quality. For each dimension:
Dimension: <name>
Weight: <percentage of total score, must sum to 100>
Question: <one question an evaluator answers to score this dimension>
Common dimensions for LLM evaluation:
For each dimension, define what each score level looks like:
Score 5 (Excellent): <concrete description of a 5/5 response>
Score 4 (Good): <concrete description of a 4/5 response>
Score 3 (Acceptable): <concrete description of a 3/5 response>
Score 2 (Poor): <concrete description of a 2/5 response>
Score 1 (Failing): <concrete description of a 1/5 response>
Anchored rubrics reduce inter-rater variance. Each level must be distinguishable — a reader seeing two outputs should be able to consistently assign different scores.
Specify:
Run a quick sanity check:
Present the final rubric as a Markdown table:
| Dimension | Weight | Score 1 | Score 3 | Score 5 |
|---|---|---|---|---|
| ... | ...% | ... | ... | ... |
Include pass/fail thresholds below the table.