From eval-framework
Execute a structured evaluation against a set of LLM outputs and produce a scored report. Use this skill when asked to "run the eval", "score these outputs", "evaluate this response", or "generate an evaluation report".
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkThis skill uses the workspace's default tool permissions.
Apply a scoring rubric to one or more LLM outputs and produce a structured scored report.
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Apply a scoring rubric to one or more LLM outputs and produce a structured scored report.
Read or request:
/eval-design to create one)For each output and each rubric dimension:
Record results in a scoring matrix:
Output: <id or label>
Dimension: Correctness Score: 4/5 "The answer covers all main points but omits X."
Dimension: Format Score: 5/5 "JSON schema matches the spec exactly."
Dimension: Clarity Score: 3/5 "Second paragraph is ambiguous about Y."
For each output:
If evaluating multiple outputs:
Output a Markdown report with:
## Evaluation Report
**Task**: <original prompt summary>
**Rubric**: <rubric name/version>
**Outputs evaluated**: <count>
### Scores
| Output | Correctness | Format | Clarity | ... | Total | Pass? |
|--------|-------------|--------|---------|-----|-------|-------|
| A | 4 | 5 | 3 | ... | 3.9 | YES |
| B | 2 | 3 | 2 | ... | 2.4 | NO |
### Key findings
- <summary of strengths across outputs>
- <summary of common weaknesses>
- <recommended next action>
For each failing output, list the top 2 dimensions dragging the score down and suggest concrete rewrites or prompt improvements that would raise those scores.