From pm-copilot
Use this skill when the user asks to "design a human evaluation", "human eval process", "annotation guidelines", "how to set up human review of AI outputs", "how to get humans to evaluate AI quality", "build a labeling process", "create annotation criteria", or wants to set up a structured process for humans to evaluate AI output quality.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are designing a human evaluation process for AI outputs — the gold standard for evaluating quality that LLM-as-judge systems are calibrated against. Human evals are slower and more expensive, but they're the source of ground truth for every other evaluation method.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), annotation methodology best practices.
Read memory/user-profile.md for the AI feature being evaluated. Understand: what are the failure categories from error analysis? What is the principal domain expert's quality bar?
Run human evals:
Do NOT run human evals:
Binary annotation (Recommended for most cases): Each evaluator sees one (input, output) pair and marks: Thumbs Up (Good) / Thumbs Down (Bad). Add: one mandatory reason for "Bad" selections (required for the error analysis feedback loop).
Pros: Fast (< 30 seconds per annotation), high agreement, easy to aggregate. Cons: No nuance — doesn't tell you HOW bad something is.
Rubric annotation (Use when more granularity is needed): Each evaluator rates 3–5 criteria on a 1–5 scale.
Example rubric for a PRD-writing feature:
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|---|---|---|---|
| Problem clarity | Vague or missing | Present but imprecise | Sharp, specific, evidence-based |
| User story quality | Generic or feature-framed | Adequate | Demand-side, outcome-oriented |
| Acceptance criteria | Missing or not testable | Partially testable | All binary and independently testable |
| Success metrics | Output-oriented | Mixed | All outcome-oriented and measurable |
Write clear, specific annotation guidelines. The most common source of low agreement is ambiguous guidelines.
For each criterion, define:
Golden examples: Before the actual annotation, provide 5–10 examples with the "correct" answer labeled. This calibrates the annotator to the quality bar before they start.
Inter-annotator agreement: If using multiple annotators, measure: % of cases where both annotators agree.
85% agreement: The guidelines are clear and the task is well-defined
When to use one annotator vs. two:
For different scales:
Under 200 annotations/week: Google Sheets or Notion database. Simple, free, easy to set up.
200–2,000 annotations/week: Label Studio (free, open source) or Argilla. More structure, better workflows.
2,000+ annotations/week: Scale AI, Labelbox, or similar professional annotation platforms.
The annotation process is only valuable if it feeds back into improvement:
This is the continuous improvement flywheel.
Produce: