Help us improve
Share bugs, ideas, or general feedback.
From pm-copilot
Use this skill when the user asks to "design a human evaluation", "human eval process", "annotation guidelines", "how to set up human review of AI outputs", "how to get humans to evaluate AI quality", "build a labeling process", "create annotation criteria", or wants to set up a structured process for humans to evaluate AI output quality.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyHow this skill is triggered — by the user, by Claude, or both
Slash command
/pm-copilot:human-eval-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are designing a human evaluation process for AI outputs — the gold standard for evaluating quality that LLM-as-judge systems are calibrated against. Human evals are slower and more expensive, but they're the source of ground truth for every other evaluation method.
Designs and reviews validity of AI-vs-human-expert benchmarks, covering rubrics, calibration probes, reviewer panels, and reliability targets. Use before data collection.
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.
Share bugs, ideas, or general feedback.
You are designing a human evaluation process for AI outputs — the gold standard for evaluating quality that LLM-as-judge systems are calibrated against. Human evals are slower and more expensive, but they're the source of ground truth for every other evaluation method.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), annotation methodology best practices.
Read memory/user-profile.md for the AI feature being evaluated. Understand: what are the failure categories from error analysis? What is the principal domain expert's quality bar?
Run human evals:
Do NOT run human evals:
Binary annotation (Recommended for most cases): Each evaluator sees one (input, output) pair and marks: Thumbs Up (Good) / Thumbs Down (Bad). Add: one mandatory reason for "Bad" selections (required for the error analysis feedback loop).
Pros: Fast (< 30 seconds per annotation), high agreement, easy to aggregate. Cons: No nuance — doesn't tell you HOW bad something is.
Rubric annotation (Use when more granularity is needed): Each evaluator rates 3–5 criteria on a 1–5 scale.
Example rubric for a PRD-writing feature:
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|---|---|---|---|
| Problem clarity | Vague or missing | Present but imprecise | Sharp, specific, evidence-based |
| User story quality | Generic or feature-framed | Adequate | Demand-side, outcome-oriented |
| Acceptance criteria | Missing or not testable | Partially testable | All binary and independently testable |
| Success metrics | Output-oriented | Mixed | All outcome-oriented and measurable |
Write clear, specific annotation guidelines. The most common source of low agreement is ambiguous guidelines.
For each criterion, define:
Golden examples: Before the actual annotation, provide 5–10 examples with the "correct" answer labeled. This calibrates the annotator to the quality bar before they start.
Inter-annotator agreement: If using multiple annotators, measure: % of cases where both annotators agree.
85% agreement: The guidelines are clear and the task is well-defined
When to use one annotator vs. two:
For different scales:
Under 200 annotations/week: Google Sheets or Notion database. Simple, free, easy to set up.
200–2,000 annotations/week: Label Studio (free, open source) or Argilla. More structure, better workflows.
2,000+ annotations/week: Scale AI, Labelbox, or similar professional annotation platforms.
The annotation process is only valuable if it feeds back into improvement:
This is the continuous improvement flywheel.
Produce: