npx claudepluginhub crouton-labs/crouton-kit --plugin authoringThis skill uses the workspace's default tool permissions.
LLM outputs are probabilistic. The same prompt produces different outputs across runs. Traditional test assertions — exact equality, deterministic return values — break almost immediately. This creates a gap most teams fill with the wrong things: vanity metrics, under-specified judges, and evaluation frameworks bought before the basics work.
Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.
Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.
Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.
LLM outputs are probabilistic. The same prompt produces different outputs across runs. Traditional test assertions — exact equality, deterministic return values — break almost immediately. This creates a gap most teams fill with the wrong things: vanity metrics, under-specified judges, and evaluation frameworks bought before the basics work.
This skill covers what actually works, ranked by when to use it, and what to skip.
For implementation patterns and code, see reference.md.
Build validation from cheap and deterministic to expensive and probabilistic. Each layer catches different failures.
Layer 1 — Structural (deterministic, fast, no LLM calls): Schema validation, regex, length bounds, required fields, JSON syntax, SQL syntax. These should run in milliseconds and have zero false positives. If an output fails structural validation, don't bother with semantic or quality checks.
Layer 2 — Semantic (model-assisted, moderate cost): NLI-based entailment for faithfulness, embedding cosine similarity, fuzzy matching. This layer checks whether outputs mean the right thing, not just whether they look right.
Layer 3 — Quality (LLM-graded, highest cost): LLM-as-judge with rubrics for subjective dimensions: tone, completeness, helpfulness. Run this last, on outputs that pass layers 1 and 2.
The ordering matters. Most production failures are structural or semantic. Running a quality judge on outputs with broken JSON is waste.
The method works. GPT-4 achieves over 80% agreement with human experts on MT-Bench (Zheng et al., 2023, NeurIPS), exceeding human-human agreement of 81%. But the headline number hides important caveats.
Biases to account for:
When LLM-as-judge works:
When it doesn't:
Implementation rule: Use binary pass/fail, not Likert scales. Multi-point scales (1-5) produce noisy, unactionable data — annotators and judges don't agree on what "3" means. Binary forces clarity and enables precision/recall measurement. Hamel Husain: "people don't know what to do with a 3 or 4."
Treat prompts as versioned assets. A prompt change is a code change — it should go through CI, run against a golden test set, and fail the build if quality drops.
The four-component pipeline:
The most common anti-pattern: jumping to LLM-based evaluation before building basic assertions. String matching and schema validation catch 60-70% of real failures and cost nothing.
These metrics appear in papers and dashboards but don't correlate with what matters:
If someone asks for "BLEU scores" on general text quality, push back.
Match the metric to the task:
| Task | Use | Don't Use |
|---|---|---|
| RAG / Q&A | RAGAS faithfulness (0.95 human agreement), answer relevance | BLEU, ROUGE |
| Summarization | NLI-based factual consistency (ROC-AUC 0.85) | ROUGE, BERTScore |
| Code generation | pass@k (execution-based), test suite pass rate | BLEU |
| Classification | Precision, recall, F1, ROC-AUC | Vector similarity |
| Translation | COMET, chrF | BLEU |
| Creative writing | Human evaluation | Any automated metric |
| General quality | Binary pass/fail + critique | Likert scales |
Static test suites catch regressions only if you run them. In production, you need continuous monitoring on real traffic.
Alerting thresholds (from research):
Key insight: Don't alert on individual sample failures. Alert on statistically significant drops in aggregate metrics.
Constitutional Classifiers (Anthropic, 2025) reduced jailbreak success from 86% to 4.4% with only 0.38% increase in false refusals on legitimate queries. Runtime guardrails work at this scale when built on synthetically generated training data aligned to a constitution.
Defense in depth: input validation (PII detection, jailbreak detection, topic bounds) → model inference (Constitutional AI training) → output validation (toxicity, schema) → business logic checks. No single layer is sufficient.
For implementation patterns, framework comparisons, and code examples, see reference.md.