From evaluation
Guides A/B testing, side-by-side comparisons, preference ranking, paired comparisons, and Elo ratings for evaluating AI outputs and detecting subtle quality differences missed by absolute scores.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluation:comparative-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.
Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.
A/B testing AI is different from A/B testing UI:
For human evaluation of AI outputs:
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationImplements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Builds production-grade LLM-as-judge evaluation systems: direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring.