From pm-copilot
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are designing an evaluation suite for an AI product feature — a systematic set of tests that catches real failure modes before they reach users. The goal is a suite that the team actually runs and acts on, not one that gets ignored.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), Aman Khan (Beyond vibe checks, 2025).
Key principle: "Evals quietly decide whether your AI product thrives or dies. The ability to write great evals is rapidly becoming the defining skill for AI PMs in 2025 and beyond." — Aman Khan, Lenny's Newsletter (2025)
Read the error analysis output (from the error-analysis skill or user input) to understand which failure categories to target. Read memory/user-profile.md for the AI feature context.
For each failure category, select the appropriate eval type:
Type 1 — Code-based evals (deterministic): Best for: Failures with objectively correct / incorrect answers. Format compliance. Structural checks. Examples:
Type 2 — Human evals: Best for: Subjective quality, domain-specific correctness, complex reasoning, new failure categories being discovered. Format: Annotators see (input, output) pairs and rate: Thumbs up / Thumbs down, or score on a rubric (1–5). Pros: Highest accuracy; catches nuanced failures Cons: Slow, expensive, can't scale; requires clear annotation guidelines Use for: Calibration, sampling for quality assurance, training LLM-as-judge
Type 3 — LLM-as-judge: Best for: Subjective quality at scale; failures that require reasoning to detect; when human evals are too slow. Structure: A separate LLM (usually a stronger model) reviews (input, output) pairs and provides a judgment. Pros: Scalable; can evaluate complex quality; can explain its reasoning Cons: Not perfectly reliable; needs calibration against human evals; can be biased
For each top failure category (from error analysis):
Name: [Failure category name] Eval type: [Code-based / Human / LLM-as-judge] What to test: [Specific aspect of the output being evaluated] Test cases needed: [How many? Where do they come from?] Pass/fail criteria: [What counts as pass? What counts as fail?] Automation plan: [When does this eval run — on every PR? Daily? Weekly?]
If using LLM-as-judge, write the judge prompt following best practices:
You are evaluating an AI assistant's response for [failure type].
**Input to the AI assistant:**
{input}
**AI assistant's response:**
{response}
**Evaluation criteria:**
[Criterion 1]: [Clear definition of what good looks like]
[Criterion 2]: [Clear definition of what good looks like]
**Scoring:**
- PASS: The response [specific pass condition]
- FAIL: The response [specific fail condition]
**Your output:**
First, briefly explain your reasoning (1–2 sentences).
Then output: PASS or FAIL
Key principles for judge prompts:
From Hamel Husain: designate one "benevolent dictator" for quality — one person whose judgment defines what PASS/FAIL means for subjective evals. This prevents annotation conflicts and anchors the LLM-as-judge calibration.
This person:
Design the full suite as three layers:
Layer 1 — Pre-commit (fast): Code-based evals only. Run on every code change. Must complete in < 60 seconds. Catches format and structural failures.
Layer 2 — Pre-deploy (medium): Code-based + LLM-as-judge on a representative sample. Run before any deployment. Should complete in < 10 minutes.
Layer 3 — Production monitoring (ongoing): LLM-as-judge on a sample of live outputs + human eval on flagged outputs. Run continuously or weekly.
Produce: