From pm-copilot
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are setting up an LLM-as-judge evaluation system — a scalable way to automatically evaluate AI output quality by using a stronger or specialized LLM to grade the outputs of the product's AI.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), Anthropic's evaluation methodology.
Read memory/user-profile.md for the AI feature being evaluated. Read the error analysis output if available to understand the failure categories to target.
Use LLM-as-judge when:
Do NOT use LLM-as-judge when:
Option A — Binary (Recommended for most cases): Judge outputs PASS or FAIL with a brief explanation. Simple, reliable, easy to aggregate.
Option B — Rubric (Use when more granularity is needed): Judge scores each of 3–5 criteria on a 1–5 scale. Good for quality tracking over time.
Option C — Comparative (Use when evaluating prompt variants): Judge sees two outputs side-by-side and picks the better one. Best for A/B testing prompt changes.
Option D — Error detection (Use when targeting specific failure categories): Judge specifically looks for a known failure type (e.g., "does this output contain any unsupported factual claims?").
Structure every judge prompt with these sections:
# Evaluation Task
You are evaluating the quality of an AI assistant's response. Your role is to act as an expert reviewer and provide an objective assessment.
## Context
[Brief description of what the AI assistant is supposed to do]
## What you're evaluating for
[Specific quality criteria — be precise about what PASS and FAIL mean]
## The evaluation
**User input:**
{{input}}
**AI assistant's response:**
{{output}}
## Instructions
1. Analyze the response against the criteria above
2. Write your reasoning in 2–3 sentences
3. State your verdict: PASS or FAIL
## Output format
Reasoning: [Your 2–3 sentence analysis]
Verdict: [PASS or FAIL]
Before deploying LLM-as-judge in production, calibrate it:
If agreement is < 85%:
Position bias: Judge tends to prefer responses in a certain format or length. Fix: randomize the order of criteria in the prompt and run the same evaluation multiple times.
Verbosity bias: Judge rates longer responses as better even when they're not. Fix: explicitly state in the prompt "Length is not a quality signal. Evaluate based on [specific criteria]."
Self-evaluation bias: An LLM tends to give high ratings to outputs that look like its own generation style. Fix: use a different model as judge than the model being evaluated.
Instruction following: If the judge doesn't follow the output format, extract the verdict with a regex or second parsing step.
Provide a Python pseudocode template for integrating LLM-as-judge into the evaluation pipeline:
import anthropic
def evaluate_output(user_input: str, ai_output: str, judge_prompt: str) -> dict:
client = anthropic.Anthropic()
judge_input = judge_prompt.replace("{{input}}", user_input).replace("{{output}}", ai_output)
response = client.messages.create(
model="claude-opus-4-6", # Use strongest model as judge
max_tokens=500,
messages=[{"role": "user", "content": judge_input}]
)
verdict_text = response.content[0].text
verdict = "PASS" if "Verdict: PASS" in verdict_text else "FAIL"
return {"verdict": verdict, "reasoning": verdict_text, "input": user_input, "output": ai_output}
# Run on a sample
results = [evaluate_output(inp, out, JUDGE_PROMPT) for inp, out in test_cases]
pass_rate = sum(1 for r in results if r["verdict"] == "PASS") / len(results)
print(f"Pass rate: {pass_rate:.1%}")
Produce: