From latestaiagents
Use an LLM as an evaluator for open-ended outputs — rubrics, pairwise comparison, calibration with human labels, bias mitigation. Covers when LLM-judge works, when it fails, and how to trust its scores. Use this skill when evaluating generative outputs at scale, building eval pipelines, or replacing expensive human review for non-critical judgments. Activate when: LLM as judge, LLM evaluator, automated evaluation, pairwise comparison, rubric evaluation, eval model.
npx claudepluginhub latestaiagents/agent-skills --plugin skills-authoringThis skill uses the workspace's default tool permissions.
**Use a strong LLM to evaluate another LLM's output. Done right, it's fast, cheap, and correlates with human judgment. Done wrong, it's biased, inconsistent, and misleading.**
Implements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Implements LLM-as-judge techniques for evaluating outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position, length, and verbosity biases.
Implements LLM-as-judge evaluations with direct scoring, pairwise comparison, bias mitigations for position, length, and verbosity in automated pipelines.
Share bugs, ideas, or general feedback.
Use a strong LLM to evaluate another LLM's output. Done right, it's fast, cheap, and correlates with human judgment. Done wrong, it's biased, inconsistent, and misleading.
exact_match is cheaperJudge rates one output against explicit criteria on a 1-5 scale.
const prompt = `You are evaluating a response. Rate it 1-5 on each criterion.
<user_query>${query}</user_query>
<response>${response}</response>
Criteria:
- accuracy: factually correct?
- helpfulness: addresses what the user asked?
- conciseness: no unnecessary verbosity?
Return JSON: {"accuracy": N, "helpfulness": N, "conciseness": N, "reasoning": "..."}`;
const judgment = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 500,
messages: [{ role: "user", content: prompt }],
});
Use a stronger model as judge than the one you're evaluating. Opus judges Sonnet; Sonnet judges Haiku.
Show two outputs, judge picks which is better. Most reliable pattern.
const prompt = `Compare two responses to the same query. Pick which is better overall.
<query>${query}</query>
<response_A>${responseA}</response_A>
<response_B>${responseB}</response_B>
Return JSON: {"winner": "A" | "B" | "tie", "reasoning": "..."}`;
To control for position bias, run each pair TWICE with order swapped. Average the judgments.
Compare output to a gold-standard reference:
const prompt = `Is the generated answer equivalent to the reference answer?
<reference>${reference}</reference>
<generated>${generated}</generated>
"Equivalent" means factually consistent — wording can differ.
Return: {"equivalent": true|false, "reasoning": "..."}`;
Cheaper than rubric but requires good references.
| Bias | Description | Mitigation |
|---|---|---|
| Position bias | Judge prefers first or second option | Randomize; run pairs twice with swapped order |
| Length bias | Judge prefers longer responses | Include "conciseness" in rubric; normalize by length |
| Self-preference | Judge prefers its own model's style | Use a DIFFERENT model family as judge |
| Verbosity bias | Judge prefers confident/flowery language | Rubric explicitly penalizes vagueness |
| Format bias | Prefers markdown/bullets over prose | Rubric targets content, not format |
Name the biases in your judge prompt — it reduces them: "Do not prefer longer responses; judge only on accuracy."
Don't trust judge scores in isolation. Calibrate:
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(human_labels, judge_labels)
Re-calibrate quarterly or whenever you change judge prompts or models.
Get judgments as JSON so you can aggregate:
const judgment = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 500,
messages: [
{ role: "user", content: judgePrompt },
{ role: "assistant", content: "{" },
],
});
const parsed = JSON.parse("{" + judgment.content[0].text);
Prefilling "{" nudges valid JSON. Validate with a schema before aggregating.
Per dataset:
Report confidence intervals (bootstrap) — a 2% score gap on 100 samples is likely noise.
Judging is expensive. Reduce cost:
prompt-caching-ttl)