Skill

eval-audit

Audits LLM eval pipelines for issues like missing error analysis, unvalidated judges, and vanity metrics. Produces prioritized findings with fixes when inheriting systems or verifying trustworthiness.

Langchain

ai-ml

testing

npx claudepluginhub hamelsmu/evals-skills --plugin evals-skills

Popularity

Stars

1,365

Forks

142

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/evals-skills:eval-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

SKILL.md

184 lines · ~2.5k tokens

Similar Skills

llm-ai-pipeline-test-review

Reviews evaluation setup for LLM/AI pipelines—metric selection, golden datasets, threshold governance, adversarial coverage, regression gating—to prevent unsafe model outputs from shipping.

2 files3 tools

vanguard-frontier-agentic

error-analysis

1.4k

Guides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.

evals-skills

audit-llm-output

Evaluates LLM outputs for factual accuracy, relevance, safety, and alignment using RAGAS, TruLens, and HELM frameworks. Useful for auditing RAG pipelines and detecting hallucinations.

grimoire

Stats

Stars1,365

Forks142

MaintenanceGood

Last CommitJun 10, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Eval Audit

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

Overview

Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
Run diagnostic checks across six areas
Produce a findings report ordered by impact, with each finding linking to a fix

Prerequisites

Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."

Connecting to Eval Infrastructure

Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.

Diagnostic Checks

Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.

Prioritize findings by impact on the user's product. Present the most impactful findings first.

1. Error Analysis

Check: Has the user done systematic error analysis on real or synthetic traces?

Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.

Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with error-analysis, or generate-synthetic-data first if no traces exist.

See: Your AI Product Needs Evals, LLM Evals FAQ

Check: Were failure categories brainstormed or observed?

Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.

Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with error-analysis, starting from traces.

See: Who Validates the Validators?

2. Evaluator Design

Check: Are evaluators binary pass/fail?

Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.

Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using write-judge-prompt.

See: Creating an LLM Judge That Drives Business Results

Check: Do LLM judge prompts target specific failure modes?

Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").

Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use write-judge-prompt.

Check: Are code-based checks used where possible?

Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.

Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.

Check: Are similarity metrics used as primary evaluation?

Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.

Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.

See: LLM Evals FAQ

3. Judge Validation

Check: Are LLM judges validated against human labels?

Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.

Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use validate-evaluator.

See: Creating an LLM Judge That Drives Business Results

Check: Is alignment measured with TPR/TNR or with raw accuracy?

Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.

Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use validate-evaluator.

Check: Is there a proper train/dev/test split?

Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.

Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use validate-evaluator.

4. Human Review Process

Check: Who is reviewing traces?

Determine whether domain experts or outsourced annotators are labeling data.

Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.

See: A Field Guide to Improving AI Products

Check: Are reviewers seeing full traces or just final outputs?

Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.

Check: How is data displayed to reviewers?

Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.

Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use build-review-interface.

See: LLM Evals FAQ

5. Labeled Data

Check: Is there enough labeled data?

For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:

Random: Always include a random sample alongside other strategies to discover unknown issues.
Clustering: Group traces by semantic similarity and review representatives from each cluster.
Data analysis: Analyze statistics on latency, turns, tool calls, and tokens for outliers.
Classification: Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
Feedback: Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.

Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with generate-synthetic-data.

6. Pipeline Hygiene

Check: Is error analysis re-run after significant changes?

Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.

Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.

Check: Are evaluators maintained?

Look for periodic re-validation of judges or refreshed evaluation datasets.

Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.

No Eval Infrastructure

If the user has no eval artifacts (no traces, no evaluators, no labeled data):

Start with error-analysis on a sample of real traces.
If no production data exists, use generate-synthetic-data to create test inputs, run them through the pipeline, then apply error-analysis to the resulting traces.
Do not recommend building evaluators, judges, or dashboards before completing error analysis.

Report Format

Present findings ordered by impact. For each:

### [Problem Title]
**Status:** [Problem exists / OK / Cannot determine]
[1-2 sentence explanation of the specific problem found]
**Fix:** [Concrete action, referencing a skill or article]

Group under the six diagnostic areas. Omit areas where no problems were found.

Anti-Patterns

Running the audit as a checklist without inspecting actual artifacts.
Reporting generic advice disconnected from what was found in the user's pipeline.
Recommending evaluators before error analysis is complete.
Suggesting LLM judges for failures that code-based checks can handle.
Treating this audit as a one-time event. Re-audit after significant pipeline changes.

eval-audit

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

eval-audit

Popularity

Invocation

Context Preview

SKILL.md

Eval Audit

Overview

Prerequisites

Connecting to Eval Infrastructure

Diagnostic Checks

1. Error Analysis

2. Evaluator Design

3. Judge Validation

4. Human Review Process

5. Labeled Data

6. Pipeline Hygiene

No Eval Infrastructure

Report Format

Anti-Patterns

Similar Skills

Help us improve

Eval Audit

Overview

Prerequisites

Connecting to Eval Infrastructure

Diagnostic Checks

1. Error Analysis

2. Evaluator Design

3. Judge Validation

4. Human Review Process

5. Labeled Data

6. Pipeline Hygiene

No Eval Infrastructure

Report Format

Anti-Patterns