From evaluation
Guides A/B testing, side-by-side comparisons, preference ranking, paired comparisons, and Elo ratings for evaluating AI outputs and detecting subtle quality differences missed by absolute scores.
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationThis skill uses the workspace's default tool permissions.
Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.
Implements LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, bias mitigation. For building eval systems, comparing model outputs, setting AI quality standards.
Implements LLM-as-judge techniques including direct scoring, pairwise comparison, and bias mitigations for evaluating LLM outputs in production pipelines.
Implements LLM-as-judge evaluations with direct scoring, pairwise comparison, bias mitigations for position, length, and verbosity in automated pipelines.
Share bugs, ideas, or general feedback.
Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.
A/B testing AI is different from A/B testing UI:
For human evaluation of AI outputs: