Skill

human-eval-design

Use this skill when the user asks to "design a human evaluation", "human eval process", "annotation guidelines", "how to set up human review of AI outputs", "how to get humans to evaluate AI quality", "build a labeling process", "create annotation criteria", or wants to set up a structured process for humans to evaluate AI output quality.

npx claudepluginhub productfculty-aipm/pm-copilot-by-product-faculty

Popularity

Stars

Forks

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/pm-copilot:human-eval-design

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are designing a human evaluation process for AI outputs — the gold standard for evaluating quality that LLM-as-judge systems are calibrated against. Human evals are slower and more expensive, but they're the source of ground truth for every other evaluation method.

SKILL.md

103 lines · ~1.2k tokens

Similar Skills

design-ai-benchmarking

145

Designs and reviews validity of AI-vs-human-expert benchmarks, covering rubrics, calibration probes, reviewer panels, and reliability targets. Use before data collection.

3 files

medsci-project

eval-suite-design

Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.

pm-copilot

output-quality-rubrics

108

Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.

evaluation

Stats

Stars37

Forks24

MaintenanceExcellent

Last CommitApr 16, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Human Eval Design

Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), annotation methodology best practices.

Step 1 — Load Context

Read memory/user-profile.md for the AI feature being evaluated. Understand: what are the failure categories from error analysis? What is the principal domain expert's quality bar?

Step 2 — When to Run Human Evals

Run human evals:

When bootstrapping: Before LLM-as-judge, to create the calibration dataset
When investigating a regression: A metric dropped — sample and manually review to understand why
Periodically (ongoing): Random sample of 50–100 live outputs per week/month to catch drift
When launching a new model or prompt change: Before and after to validate quality didn't regress

Do NOT run human evals:

On every inference in production (too slow and expensive)
Without clear annotation guidelines (produces unreliable data)

Step 3 — Annotation Design

Binary annotation (Recommended for most cases): Each evaluator sees one (input, output) pair and marks: Thumbs Up (Good) / Thumbs Down (Bad). Add: one mandatory reason for "Bad" selections (required for the error analysis feedback loop).

Pros: Fast (< 30 seconds per annotation), high agreement, easy to aggregate. Cons: No nuance — doesn't tell you HOW bad something is.

Rubric annotation (Use when more granularity is needed): Each evaluator rates 3–5 criteria on a 1–5 scale.

Example rubric for a PRD-writing feature:

Criterion	1 (Poor)	3 (Acceptable)	5 (Excellent)
Problem clarity	Vague or missing	Present but imprecise	Sharp, specific, evidence-based
User story quality	Generic or feature-framed	Adequate	Demand-side, outcome-oriented
Acceptance criteria	Missing or not testable	Partially testable	All binary and independently testable
Success metrics	Output-oriented	Mixed	All outcome-oriented and measurable

Step 4 — Annotation Guidelines

Write clear, specific annotation guidelines. The most common source of low agreement is ambiguous guidelines.

For each criterion, define:

What PASS looks like (with 1–2 examples)
What FAIL looks like (with 1–2 examples)
The edge case: what if the output is partially correct?
What to do if you're unsure (escalate to the principal domain expert)

Golden examples: Before the actual annotation, provide 5–10 examples with the "correct" answer labeled. This calibrates the annotator to the quality bar before they start.

Step 5 — Agreement and Reliability

Inter-annotator agreement: If using multiple annotators, measure: % of cases where both annotators agree.

85% agreement: The guidelines are clear and the task is well-defined
70–85% agreement: Guidelines need clarification on the gray areas
< 70% agreement: The task definition is unclear; redesign guidelines

When to use one annotator vs. two:

Binary annotation: One annotator is fine for most cases. Two annotators + tiebreaker for high-stakes evals.
Rubric annotation: Two annotators + reconciliation for all cases.

Step 6 — Annotation Tooling

For different scales:

Under 200 annotations/week: Google Sheets or Notion database. Simple, free, easy to set up.

200–2,000 annotations/week: Label Studio (free, open source) or Argilla. More structure, better workflows.

2,000+ annotations/week: Scale AI, Labelbox, or similar professional annotation platforms.

Step 7 — Feedback Loop

The annotation process is only valuable if it feeds back into improvement:

Annotators flag Bad outputs
Bad outputs are clustered by failure category (error analysis)
Most common failures become new eval test cases
Eval suite is updated to catch those failures automatically
Model/prompt is updated to fix the failures
Annotation sample confirms the fix didn't create new failures

This is the continuous improvement flywheel.

Step 8 — Output

Produce:

Annotation interface design (fields, criteria, format)
Annotation guidelines document (per-criterion definitions with examples)
Golden examples set (5–10 labeled examples for calibration)
Agreement measurement plan
Sampling strategy: how many outputs to review per week and how to select them (random vs. targeted)

human-eval-design

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

human-eval-design

Popularity

Invocation

Context Preview

SKILL.md

Human Eval Design

Step 1 — Load Context

Step 2 — When to Run Human Evals

Step 3 — Annotation Design

Step 4 — Annotation Guidelines

Step 5 — Agreement and Reliability

Step 6 — Annotation Tooling

Step 7 — Feedback Loop

Step 8 — Output

Similar Skills

Help us improve

Human Eval Design

Step 1 — Load Context

Step 2 — When to Run Human Evals

Step 3 — Annotation Design

Step 4 — Annotation Guidelines

Step 5 — Agreement and Reliability

Step 6 — Annotation Tooling

Step 7 — Feedback Loop

Step 8 — Output