Skill

eval-and-quality-gates

Implements LLM evaluation strategies: structural/semantic validation, LLM-as-judge with bias mitigations, prompt regression testing, CI quality gates, production monitoring, and guardrails.

OpenAI

Anthropic

CI/CD

ai-ml

testing

npx claudepluginhub crouton-labs/crouton-kit --plugin authoring

Tool Access

This skill uses the workspace's default tool permissions.

Preview

LLM outputs are probabilistic. The same prompt produces different outputs across runs. Traditional test assertions — exact equality, deterministic return values — break almost immediately. This creates a gap most teams fill with the wrong things: vanity metrics, under-specified judges, and evaluation frameworks bought before the basics work.

Supporting Assets

reference.md

SKILL.md

Similar Skills

e2e-testing

170.6k

Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.

everything-claude-code

nextjs-turbopack

170.6k

Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.

everything-claude-code

laravel-plugin-discovery

170.6k

Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.

everything-claude-code

Stats

Parent Repo Stars24

Parent Repo Forks3

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Evaluation and Quality Gates

This skill covers what actually works, ranked by when to use it, and what to skip.

For implementation patterns and code, see reference.md.

The Three-Layer Hierarchy

Build validation from cheap and deterministic to expensive and probabilistic. Each layer catches different failures.

Layer 1 — Structural (deterministic, fast, no LLM calls): Schema validation, regex, length bounds, required fields, JSON syntax, SQL syntax. These should run in milliseconds and have zero false positives. If an output fails structural validation, don't bother with semantic or quality checks.

Layer 2 — Semantic (model-assisted, moderate cost): NLI-based entailment for faithfulness, embedding cosine similarity, fuzzy matching. This layer checks whether outputs mean the right thing, not just whether they look right.

Layer 3 — Quality (LLM-graded, highest cost): LLM-as-judge with rubrics for subjective dimensions: tone, completeness, helpfulness. Run this last, on outputs that pass layers 1 and 2.

The ordering matters. Most production failures are structural or semantic. Running a quality judge on outputs with broken JSON is waste.

LLM-as-Judge

The method works. GPT-4 achieves over 80% agreement with human experts on MT-Bench (Zheng et al., 2023, NeurIPS), exceeding human-human agreement of 81%. But the headline number hides important caveats.

Biases to account for:

Position bias: Claude-v1 was biased toward the first position 75% of the time; GPT-4 is better but still shows ~30% bias. Mitigate by running each pairwise comparison twice with swapped order.
Verbosity bias: 91% of judges fail verbosity attacks (longer = rated better). GPT-4 fails at 8.7%, still not zero. Explicitly instruct the judge to evaluate quality, not length.
Self-enhancement bias: Claude-v1 rated its own outputs 25% higher than competitors'. Use a different model as judge than the model generating outputs — this eliminates self-enhancement bias completely.

When LLM-as-judge works:

STEM and humanities tasks: 70-76% win rate in pairwise comparisons
Factuality checking on clear domains: 84-85% accuracy
Binary pass/fail with rubrics: >90% human agreement achievable (Hamel Husain's Honeycomb case study, 3 iterations)

When it doesn't:

Creative writing: LLMs pass creativity tests 3-10x less often than human writers
Hallucination detection: best models achieve only 58.5% accuracy on HaluEval
Fine-grained relevance ("highly relevant" vs "perfectly relevant"): 30-50% accuracy
Complex reasoning tasks: requires 70B+ parameter models; smaller models are brittle

Implementation rule: Use binary pass/fail, not Likert scales. Multi-point scales (1-5) produce noisy, unactionable data — annotators and judges don't agree on what "3" means. Binary forces clarity and enables precision/recall measurement. Hamel Husain: "people don't know what to do with a 3 or 4."

Prompt Regression Testing

Treat prompts as versioned assets. A prompt change is a code change — it should go through CI, run against a golden test set, and fail the build if quality drops.

The four-component pipeline:

Version control: Store prompts in dedicated files, not embedded strings. Track in git.
Golden set: Start with ~30 real production examples covering happy paths, edge cases, and known failures. Expand until no new failure modes emerge. Production traces are the best source.
Assertion suite: Layer from cheap to expensive (structural → content → semantic → quality). Not every test needs an LLM judge.
CI gate: Run on every PR that touches prompts or system prompt logic. Fail on regressions.

The most common anti-pattern: jumping to LLM-based evaluation before building basic assertions. String matching and schema validation catch 60-70% of real failures and cost nothing.

Metrics That Are Theater

These metrics appear in papers and dashboards but don't correlate with what matters:

BLEU: "Poor correlation with human judgements" — used in 95+ papers despite this evidence. Bottom of WMT22/WMT23 leaderboards for translation.
ROUGE: Can't capture factuality or faithfulness. 62.6% of papers using it provided no implementation details.
Perplexity as quality proxy: Measures model confidence, not output quality. Low perplexity ≠ good output.
Generic vector/n-gram similarity: For classification tasks, positive and negative instance distributions overlap too much to be useful (Eugene Yan).
Likert scales (1-5): Subjective, inconsistent across raters, actionable on neither 3 nor 4.

If someone asks for "BLEU scores" on general text quality, push back.

Metrics That Actually Work

Match the metric to the task:

Task	Use	Don't Use
RAG / Q&A	RAGAS faithfulness (0.95 human agreement), answer relevance	BLEU, ROUGE
Summarization	NLI-based factual consistency (ROC-AUC 0.85)	ROUGE, BERTScore
Code generation	pass@k (execution-based), test suite pass rate	BLEU
Classification	Precision, recall, F1, ROC-AUC	Vector similarity
Translation	COMET, chrF	BLEU
Creative writing	Human evaluation	Any automated metric
General quality	Binary pass/fail + critique	Likert scales

Production Monitoring

Static test suites catch regressions only if you run them. In production, you need continuous monitoring on real traffic.

Alerting thresholds (from research):

Pass rate drops >5% below baseline over 24h: investigate
Latency >2x p99 baseline: check provider status
Toxicity rate increases at all: immediate investigation

Key insight: Don't alert on individual sample failures. Alert on statistically significant drops in aggregate metrics.

Guardrails

Constitutional Classifiers (Anthropic, 2025) reduced jailbreak success from 86% to 4.4% with only 0.38% increase in false refusals on legitimate queries. Runtime guardrails work at this scale when built on synthetically generated training data aligned to a constitution.

Defense in depth: input validation (PII detection, jailbreak detection, topic bounds) → model inference (Constitutional AI training) → output validation (toxicity, schema) → business logic checks. No single layer is sufficient.

For implementation patterns, framework comparisons, and code examples, see reference.md.