Skill

output-quality-rubrics

Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.

ai-ml

npx claudepluginhub owl-listener/ai-design-skills --plugin evaluation

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.

SKILL.md

Similar Skills

advanced-evaluation

36.4k

Implements LLM-as-judge techniques including direct scoring, pairwise comparison, and bias mitigations for evaluating LLM outputs in production pipelines.

antigravity-awesome-skills

rulph

146

Builds interactive rubrics to evaluate artifacts with parallel multi-model scoring (Codex, Gemini, Claude), then iteratively improves one criterion at a time until threshold met.

7 tools

hoyeon

advanced-evaluation

Implements LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, bias mitigation. For building eval systems, comparing model outputs, setting AI quality standards.

6 files

shipshitdev-library

Stats

Parent Repo Stars18

Parent Repo Forks3

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Output Quality Rubrics

Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.

Core Quality Dimensions

Accuracy: Is the information correct? Are claims verifiable? Are there hallucinations?
Relevance: Does the output address what the user actually asked? Is everything included necessary?
Completeness: Does the output cover everything needed? Are there gaps?
Helpfulness: Can the user actually use this output to accomplish their goal?
Clarity: Is the output easy to understand? Is it well-structured?
Tone appropriateness: Does the output match the expected tone for the context?
Safety: Is the output free from harmful, biased, or inappropriate content?

Building a Rubric

For each dimension, define a scale: Example — Accuracy (1-5):

5: All claims are verifiable and correct. No hallucinations.
4: Minor inaccuracies that don't affect usefulness. No hallucinations.
3: Some inaccuracies that could mislead if not caught. No dangerous hallucinations.
2: Significant inaccuracies. User would need to verify most claims.
1: Major hallucinations or factually wrong information presented confidently.

Weighting Dimensions

Not all dimensions matter equally for every use case:

A medical AI weights accuracy and safety highest
A creative writing AI weights helpfulness and tone highest
A coding AI weights accuracy and completeness highest
A customer service AI weights tone and helpfulness highest Define weights when creating the rubric. Make the priorities explicit.

Rubric Calibration

A rubric is only useful if evaluators use it consistently:

Anchor examples: Provide sample outputs at each score level
Calibration sessions: Have multiple evaluators score the same outputs and discuss disagreements
Inter-rater reliability: Measure agreement between evaluators and refine the rubric until agreement is high
Edge case guidance: Document how to score ambiguous cases

Design Artefacts

Scoring rubric with dimension definitions and scales
Anchor examples at each score level
Dimension weighting specifications per use case
Calibration session protocols
Scoring templates and checklists