From eval-framework
Compare two LLM outputs on the same evaluation criteria and recommend a winner with justification. Use this skill when asked to "compare these outputs", "which response is better", "A/B eval", or "pick the best candidate".
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkThis skill uses the workspace's default tool permissions.
Compare two candidate outputs on shared evaluation criteria and produce a justified recommendation.
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Compare two candidate outputs on shared evaluation criteria and produce a justified recommendation.
Read or request:
/eval-design to create one, or define ad-hoc dimensions)If no rubric exists, generate a minimal one on the spot:
Score Output A across all dimensions first, then score Output B. Do NOT read Output B while scoring Output A — this prevents anchoring bias.
Record scores in a comparison table:
| Dimension | Weight | Score A | Score B | Notes |
|---|---|---|---|---|
| ... | ...% | ... | ... | ... |
For each dimension, mark which output wins (A, B, or Tie):
Report trade-offs explicitly when one output wins on some dimensions but loses on others:
Output A is stronger on: Correctness (+1.5 pts), Completeness (+0.5 pts)
Output B is stronger on: Clarity (+1.0 pts), Format compliance (+1.0 pts)
Ask the user: which dimensions matter most for this use case? If the user provides a priority, recompute with adjusted weights and verify the winner is unchanged.
Write a concise recommendation:
## Recommendation: Output <A|B>
**Winner**: Output <A|B> (score: <X.X> vs <Y.Y>)
**Primary reasons**:
- <reason 1 — cite specific text from the winning output>
- <reason 2>
**Caveats**:
- <where the losing output does better, if relevant>
- <conditions under which the recommendation would flip>
**Suggested improvement for the winner**:
- <one concrete change that would make the winning output even better>
Before presenting, run a quick sanity check: