From outputai
Designs effective LLM judge .prompt files for evaluators. Use when creating judgeVerdict/judgeScore/judgeLabel prompts or when existing judges produce unreliable results.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-eval-judge-promptThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
An LLM judge evaluates workflow output for a **single, specific failure mode** identified during error analysis. This skill covers how to design the `.prompt` file that powers `judgeVerdict()`, `judgeScore()`, or `judgeLabel()` calls. For the file format basics, see `output-dev-prompt-file`. For error analysis, see `output-eval-error-analysis`.
An LLM judge evaluates workflow output for a single, specific failure mode identified during error analysis. This skill covers how to design the .prompt file that powers judgeVerdict(), judgeScore(), or judgeLabel() calls. For the file format basics, see output-dev-prompt-file. For error analysis, see output-eval-error-analysis.
Before writing a judge prompt:
output-eval-error-analysis)ground_truth labels in dataset YAML filesVerdict.* helpers (contains, matches, gte, etc.) cannot reliably detect this failureEvery effective judge prompt has exactly four components.
State the single failure mode being evaluated. Be specific and observable.
Good criteria (specific, observable):
Bad criteria (vague, holistic):
Define exactly what constitutes pass and fail. Always binary — no Likert scales, no 1-5 ratings, no "partially meets criteria."
PASS: The blog post uses formal language throughout. Professional vocabulary,
complete sentences, no slang, no contractions, no first-person casual asides.
FAIL: The blog post contains one or more instances of casual language: slang,
contractions ("don't", "can't"), informal asides ("pretty cool", "super important"),
or conversational filler ("honestly", "basically").
Why binary: Likert scales create ambiguous boundaries (what's the difference between a 3 and a 4?). Binary forces precise definitions that LLMs can apply consistently and that you can validate against human labels.
Include at least three labeled examples: one clear pass, one clear fail, and one borderline case. Borderline examples are the most valuable — they teach the judge where the decision boundary lies.
Draw examples from your training split only (see output-eval-validate-judge). Never use dev or test examples as few-shot — that's data leakage.
Each example must include:
Request JSON output with critique before verdict. This forces the judge to reason before deciding, which improves accuracy.
{
"critique": "Detailed analysis of the output against the criterion...",
"verdict": "pass"
}
Always put critique first in the schema. If verdict comes first, the judge commits to a decision before reasoning.
.prompt File ExampleA judge for the "tone mismatch" failure mode:
# tests/evals/[email protected]
---
provider: anthropic
# current as of 2026-05-04 — run output-dev-model-selection for the latest
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1500
---
<system>
You are an evaluation judge. Your task is to determine whether a blog post maintains the requested tone throughout.
## Criterion
Assess whether the blog post consistently uses the requested tone. A single paragraph that breaks tone is a failure.
## Definitions
PASS: The blog post maintains the requested tone in every paragraph. Word choice, sentence structure, and rhetorical style all align with the requested tone.
FAIL: The blog post contains one or more paragraphs where the tone shifts away from what was requested. Common failures include:
- Formal request but casual language appears ("pretty cool", "super important", contractions)
- Professional request but opinionated editorializing appears
- Technical request but oversimplified explanations appear
## Examples
### Example 1: PASS
Requested tone: formal
Blog excerpt: "The implications of quantum computing for cryptographic security are substantial. Current encryption standards rely on the computational infeasibility of factoring large prime numbers, a guarantee that quantum algorithms may undermine."
Critique: The excerpt uses professional vocabulary ("implications", "computational infeasibility"), complete sentences, no contractions, and maintains an academic register. Consistent formal tone throughout.
Verdict: pass
### Example 2: FAIL
Requested tone: formal
Blog excerpt: "Quantum computing is basically going to break all our encryption. It's pretty wild when you think about it — everything we thought was secure might not be."
Critique: The excerpt contains multiple casual markers: "basically", "pretty wild", contractions ("It's", "might not be"), and conversational filler ("when you think about it"). This directly violates the formal tone request.
Verdict: fail
### Example 3: BORDERLINE (fail)
Requested tone: formal
Blog excerpt: "Quantum computing represents a paradigm shift in computational capability. The technology is incredibly promising, though it's important to note the current limitations in qubit stability and error correction."
Critique: Mostly formal, but contains "incredibly promising" (informal intensifier) and "it's" (contraction). While the overall register is professional, these lapses break the formal tone requirement. Even minor inconsistencies constitute a failure.
Verdict: fail
## Output Format
Return a JSON object with exactly two fields:
- "critique": A detailed analysis (3-5 sentences) citing specific evidence from the blog post
- "verdict": Either "pass" or "fail"
</system>
<user>
Requested tone: {{ requested_tone }}
Blog title: {{ blog_title }}
Blog post:
{{ blog_post }}
Evaluate whether this blog post consistently maintains the requested tone.
</user>
judgeVerdict()After creating the .prompt file, wire it to an evaluator using verify() and judgeVerdict():
// tests/evals/evaluators.ts
import { verify, judgeVerdict } from '@outputai/evals';
import { z } from '@outputai/core';
import { blogInput, blogOutput } from './schemas.js';
export const checkTone = verify(
{
name: 'check_tone',
input: blogInput,
output: blogOutput
},
async ({ input, output, context }) =>
judgeVerdict({
prompt: 'judge_tone@v1',
variables: {
requested_tone: String(context.ground_truth.expected_tone ?? input.tone ?? 'professional'),
blog_title: output.title,
blog_post: output.blog_post
}
})
);
Then add it to the eval workflow:
// tests/evals/workflow.ts
import { evalWorkflow } from '@outputai/evals';
import { checkTone } from './evaluators.js';
export default evalWorkflow({
name: 'blog_generator_eval',
evals: [
{
evaluator: checkTone,
criticality: 'required',
interpret: { type: 'verdict' }
}
]
});
Feed the judge only what it needs to evaluate the criterion. Extra context adds noise and cost.
| Failure Mode | Required Variables | Not Needed |
|---|---|---|
| Tone mismatch | requested_tone, blog_post | topic, input constraints |
| Off-topic drift | topic, blog_post | tone, length requirements |
| Hallucinated claims | blog_post, source_material | topic, tone |
| Faithfulness | summary, original_document | formatting requirements |
| Missing requirements | requirements_list, blog_post | topic (unless relevant) |
Use context.ground_truth for expected values that vary per dataset. Use input.* for values from the workflow input. Use output.* for the workflow output being evaluated.
judgeScore() VariantUse judgeScore() when you need a numeric quality score rather than binary pass/fail. Apply the same four-component design.
.prompt file for scoring# tests/evals/[email protected]
---
provider: anthropic
# current as of 2026-05-04 — run output-dev-model-selection for the latest
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1500
---
<system>
You are an evaluation judge. Score the overall writing quality of a blog post on a scale of 0.0 to 1.0.
## Scoring Criteria
- 0.0-0.3: Major issues — incoherent, riddled with errors, or completely off-topic
- 0.4-0.6: Mediocre — readable but has significant quality issues (poor structure, weak arguments, factual gaps)
- 0.7-0.8: Good — well-structured, clear, minor issues only
- 0.9-1.0: Excellent — polished, engaging, publication-ready
## Output Format
Return a JSON object with:
- "critique": Detailed analysis of quality strengths and weaknesses (3-5 sentences)
- "score": A number between 0.0 and 1.0
</system>
<user>
Topic: {{ topic }}
Blog title: {{ blog_title }}
Blog post:
{{ blog_post }}
Score the writing quality of this blog post.
</user>
judgeScore()export const checkQuality = verify(
{ name: 'check_quality', input: blogInput, output: blogOutput },
async ({ input, output }) =>
judgeScore({
prompt: 'judge_quality@v1',
variables: {
topic: input.topic,
blog_title: output.title,
blog_post: output.blog_post
}
})
);
In the eval workflow, use interpret: { type: 'number' } with thresholds:
{
evaluator: checkQuality,
criticality: 'required',
interpret: { type: 'number', pass: 0.7, partial: 0.4 }
}
judgeLabel() VariantUse judgeLabel() when you need classification into named categories.
export const checkToneLabel = verify(
{ name: 'check_tone_label', input: blogInput, output: blogOutput },
async ({ output }) =>
judgeLabel({
prompt: 'judge_tone_label@v1',
variables: {
blog_title: output.title,
blog_post: output.blog_post
}
})
);
In the eval workflow, use interpret: { type: 'string' } with label lists:
{
evaluator: checkToneLabel,
criticality: 'informational',
interpret: { type: 'string', pass: ['professional', 'formal'], partial: ['casual'] }
}
Run
output-dev-model-selectionto resolve each tier below to a current model ID.
| Tier | When to Use | Cost |
|---|---|---|
Smallest in family (speed/cost priority) | Default for most judges. Fast, cheap, good at following structured instructions. | Low |
Mid-tier (balance priority) | Complex reasoning required (faithfulness checking, multi-step logical analysis). | Medium |
Top-tier (reasoning priority) | Only if mid-tier fails validation. Rarely needed. | High |
Always set temperature: 0 for judges. Reproducibility matters more than creativity.
Escalation strategy: start with the smallest tier. If the judge fails validation (TPR/TNR below 80%), move up one tier before rewriting the prompt — the model upgrade alone often fixes it.
output-eval-validate-judge)output-eval-error-analysis — Identify the failure mode this judge targetsoutput-dev-eval-testing — Implementation reference for verify(), judgeVerdict(), evalWorkflow()output-dev-prompt-file — .prompt file format, Liquid.js templating, provider configurationoutput-eval-validate-judge — Validate this judge against human labels after writing itoutput-eval-dataset-design — Generate diverse datasets for judge validationnpx claudepluginhub growthxai/output --plugin outputaiDesigns binary Pass/Fail LLM-as-Judge prompts for subjective evaluation criteria like tone, faithfulness, relevance that code checks cannot handle. Includes task definition, definitions, and few-shot examples.
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.
Teaches production-grade LLM-as-Judge evaluation: direct scoring, pairwise comparison, bias mitigation (position, length, self-enhancement), and evaluation pipeline design.