Help us improve
Share bugs, ideas, or general feedback.
From pm-copilot
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyHow this skill is triggered — by the user, by Claude, or both
Slash command
/pm-copilot:llm-as-judgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are setting up an LLM-as-judge evaluation system — a scalable way to automatically evaluate AI output quality by using a stronger or specialized LLM to grade the outputs of the product's AI.
Implements LLM-as-judge systems with direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring for automated quality assessment.
Implements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Share bugs, ideas, or general feedback.
You are setting up an LLM-as-judge evaluation system — a scalable way to automatically evaluate AI output quality by using a stronger or specialized LLM to grade the outputs of the product's AI.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), Anthropic's evaluation methodology.
Read memory/user-profile.md for the AI feature being evaluated. Read the error analysis output if available to understand the failure categories to target.
Use LLM-as-judge when:
Do NOT use LLM-as-judge when:
Option A — Binary (Recommended for most cases): Judge outputs PASS or FAIL with a brief explanation. Simple, reliable, easy to aggregate.
Option B — Rubric (Use when more granularity is needed): Judge scores each of 3–5 criteria on a 1–5 scale. Good for quality tracking over time.
Option C — Comparative (Use when evaluating prompt variants): Judge sees two outputs side-by-side and picks the better one. Best for A/B testing prompt changes.
Option D — Error detection (Use when targeting specific failure categories): Judge specifically looks for a known failure type (e.g., "does this output contain any unsupported factual claims?").
Structure every judge prompt with these sections:
# Evaluation Task
You are evaluating the quality of an AI assistant's response. Your role is to act as an expert reviewer and provide an objective assessment.
## Context
[Brief description of what the AI assistant is supposed to do]
## What you're evaluating for
[Specific quality criteria — be precise about what PASS and FAIL mean]
## The evaluation
**User input:**
{{input}}
**AI assistant's response:**
{{output}}
## Instructions
1. Analyze the response against the criteria above
2. Write your reasoning in 2–3 sentences
3. State your verdict: PASS or FAIL
## Output format
Reasoning: [Your 2–3 sentence analysis]
Verdict: [PASS or FAIL]
Before deploying LLM-as-judge in production, calibrate it:
If agreement is < 85%:
Position bias: Judge tends to prefer responses in a certain format or length. Fix: randomize the order of criteria in the prompt and run the same evaluation multiple times.
Verbosity bias: Judge rates longer responses as better even when they're not. Fix: explicitly state in the prompt "Length is not a quality signal. Evaluate based on [specific criteria]."
Self-evaluation bias: An LLM tends to give high ratings to outputs that look like its own generation style. Fix: use a different model as judge than the model being evaluated.
Instruction following: If the judge doesn't follow the output format, extract the verdict with a regex or second parsing step.
Provide a Python pseudocode template for integrating LLM-as-judge into the evaluation pipeline:
import anthropic
def evaluate_output(user_input: str, ai_output: str, judge_prompt: str) -> dict:
client = anthropic.Anthropic()
judge_input = judge_prompt.replace("{{input}}", user_input).replace("{{output}}", ai_output)
response = client.messages.create(
model="claude-opus-4-6", # Use strongest model as judge
max_tokens=500,
messages=[{"role": "user", "content": judge_input}]
)
verdict_text = response.content[0].text
verdict = "PASS" if "Verdict: PASS" in verdict_text else "FAIL"
return {"verdict": verdict, "reasoning": verdict_text, "input": user_input, "output": ai_output}
# Run on a sample
results = [evaluate_output(inp, out, JUDGE_PROMPT) for inp, out in test_cases]
pass_rate = sum(1 for r in results if r["verdict"] == "PASS") / len(results)
print(f"Pass rate: {pass_rate:.1%}")
Produce: