From ai-agent-evals
Opinionated patterns for evaluating AI agents and LLM outputs (2026). Use this skill whenever writing, designing, or reviewing evals for AI agents, LLM pipelines, or model outputs. Trigger when the user mentions "evals", "evaluation", "LLM judge", "grading", "rubric", "test cases for AI", "benchmark", "agent testing", "pydantic-evals", "model quality", or asks how to measure whether an agent or LLM output is good. Also trigger when writing pytest tests that assess LLM-generated content, or when the user is iterating on prompts/skills and wants to know if they're improving.
npx claudepluginhub maxnoller/claude-code-plugins --plugin ai-agent-evalsThis skill uses the workspace's default tool permissions.
Evals answer one question: is this getting better or worse? Without them you're guessing. These patterns share a philosophy: start with what you can check deterministically, add LLM judges only for what code can't measure, and let real failures — not speculation — drive what you test.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Automates semantic versioning and release workflow for Claude Code plugins: bumps versions in package.json, marketplace.json, plugin.json; verifies builds; creates git tags, GitHub releases, changelogs.
Evals answer one question: is this getting better or worse? Without them you're guessing. These patterns share a philosophy: start with what you can check deterministically, add LLM judges only for what code can't measure, and let real failures — not speculation — drive what you test.
Every eval should exist because something broke or because you need to prove something works. Don't write evals for imaginary scenarios.
Always reach for deterministic checks before LLM-based grading. Deterministic checks are fast, free, perfectly consistent, and easy to debug. Use them for everything they can cover:
LLM judges handle what code can't: tone, relevance, whether an explanation actually makes sense, hallucination detection against source material. But they're slow, expensive, and non-deterministic — so layer them on top of deterministic checks, not instead of them.
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge
dataset = Dataset(
cases=[Case(name="handles refund", inputs="I want a refund for order #123")],
evaluators=[
# LLM judge only for what code can't check
LLMJudge(
rubric="Response acknowledges the order number, expresses empathy, and provides clear next steps",
include_input=True,
)
],
)
See references/deterministic-before-judges.md for the full layering approach and when to use which.
Evaluation is easier than generation — a model that struggles to produce good output can often reliably judge it, because judging with both question and answer visible is a narrower task. Use this asymmetry deliberately.
Two types of judge, and picking the wrong one wastes effort:
Write rubrics that specify both what success looks like AND what failure looks like. Always request reasoning (include_reason=True) so you can debug when the judge disagrees with you.
See references/llm-as-a-judge.md for rubric writing, context levels, evaluator selection, and the pydantic-evals implementation.
Don't invent eval scenarios. Collect them:
Start with 10-20 cases covering: explicit triggers, implicit triggers, edge cases, and negative controls (inputs that should NOT trigger the behavior). Keep test cases permanently — they prevent regressions when models or prompts change.
See references/eval-driven-iteration.md for the full workflow from failure to test case to confidence.