From evaluation
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationThis skill uses the workspace's default tool permissions.
Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.
Implements LLM-as-judge techniques including direct scoring, pairwise comparison, and bias mitigations for evaluating LLM outputs in production pipelines.
Builds interactive rubrics to evaluate artifacts with parallel multi-model scoring (Codex, Gemini, Claude), then iteratively improves one criterion at a time until threshold met.
Implements LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, bias mitigation. For building eval systems, comparing model outputs, setting AI quality standards.
Share bugs, ideas, or general feedback.
Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.
For each dimension, define a scale: Example — Accuracy (1-5):
Not all dimensions matter equally for every use case:
A rubric is only useful if evaluators use it consistently: