From evals-skills
Designs binary Pass/Fail LLM-as-Judge prompts for subjective evaluation criteria like tone, faithfulness, relevance that code checks cannot handle. Includes task definition, definitions, and few-shot examples.
npx claudepluginhub hamelsmu/evals-skills --plugin evals-skillsThis skill uses the workspace's default tool permissions.
Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
Every judge prompt requires exactly four components:
State what the judge evaluates. One failure mode per judge.
You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.
Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."
Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.
## Definitions
PASS: The email matches the expected communication style for the client persona:
- Luxury Buyers: formal language, emphasis on exclusive features, premium
market positioning, no casual slang
- First-Time Homebuyers: warm and encouraging tone, educational explanations,
avoids jargon, patient and supportive
- Investors: data-driven language, ROI-focused, market analytics, concise
and professional
FAIL: The email uses a tone mismatched to the client persona. Examples:
- Using casual slang ("hey, check out this pad!") for a luxury buyer
- Using heavy financial jargon for a first-time homebuyer
- Using overly emotional language for an investor
Include labeled Pass and Fail examples from your human-labeled data.
## Examples
### Example 1: PASS
Client Persona: Luxury Buyer
Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
at 1200 Pacific Heights Drive. This distinguished property features..."
Critique: The email opens with a formal salutation and uses language
consistent with luxury positioning — "exclusive listing," "distinguished
property." No casual slang or informal phrasing. The tone matches the
luxury buyer persona throughout.
Result: Pass
### Example 2: FAIL
Client Persona: Luxury Buyer
Email: "Hey! Just found this awesome place you might like. It's got a
pool and stuff, super cool neighborhood..."
Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
"got a pool and stuff," and "super cool" are casual slang inappropriate
for a luxury buyer. The email reads like a text message, not a
professional communication for a high-end client.
Result: Fail
### Example 3: PASS (borderline)
Client Persona: First-Time Homebuyer
Email: "Hi Sarah, I found a property that might be a great fit for your
first home. The neighborhood has good schools nearby, and the monthly
payment would be similar to what you're currently paying in rent..."
Critique: The greeting is warm but not overly casual. The email explains
the property in relatable terms — comparing mortgage to rent, mentioning
schools — which is educational without being condescending. It avoids
jargon like "amortization" or "LTV ratio." While not deeply technical,
this matches the supportive tone expected for a first-time buyer.
Result: Pass
Rules for selecting examples:
Enforce structured output using your LLM provider's schema enforcement (e.g., response_format in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.
The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.
{
"critique": "string — detailed assessment of the output against the criterion",
"result": "Pass or Fail"
}
Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.
Feed only what the judge needs for an accurate decision:
| Failure Mode | What the Judge Needs |
|---|---|
| Tone mismatch | Client persona + generated email |
| Answer faithfulness | Retrieved context + generated answer |
| SQL correctness | User query + generated SQL + schema |
| Instruction following | System prompt rules + generated response |
| Tool call justification | Conversation history + tool call + tool result |
For long documents, feed only the relevant snippet, not the entire document.
Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.