Plugin

evals-skills

Name: evals-skills
Author: hamelsmu

Build robust LLM evaluation pipelines by auditing setups for issues, conducting error analysis on traces, generating synthetic test data, designing and validating LLM-as-judge prompts, evaluating RAG with custom metrics, and creating browser-based UIs for human annotation and labeling.

npx claudepluginhub hamelsmu/evals-skills --plugin evals-skills

Component Overview

Skills

Component Details

Skills (7)

build-review-interface

/build-review-interface

Build a custom browser-based annotation interface tailored to your data for reviewing LLM traces and collecting structured feedback. Use when you need to build an annotation tool, review traces, or collect human labels.

error-analysis

/error-analysis

Help the user systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.

eval-audit

/eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

evaluate-rag

/evaluate-rag

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

generate-synthetic-data

/generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

validate-evaluator

/validate-evaluator

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).

write-judge-prompt

/write-judge-prompt

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.

README

Eval Skills for AI Coding Agents

Skills that guide AI coding agents to help you build LLM evaluations.

These skills guard against common mistakes I've seen helping 50+ companies and teaching students in our AI Evals course. If you're new to evals, see questions.md for free resources on the fundamentals.

New to Evals? Start Here

If you are new to evals, start with the eval-audit skill. Give your coding agent these instructions:

Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.

The audit isn't a complete solution, but it will catch common problems we've seen in evals. It will also recommend other skills to use to fix the problems.

Installation

In Claude Code, run these two commands:

# Step 1: Register the plugin repository
/plugin marketplace add hamelsmu/evals-skills

# Step 2: Install the plugin
/plugin install evals-skills@hamelsmu-evals-skills

To upgrade:

/plugin update evals-skills@hamelsmu-evals-skills

After installation, restart Claude Code. The skills will appear as /evals-skills:<skill-name>.

Installation (npx skills)

If you use the open Skills CLI, install from this repo with:

npx skills add https://github.com/hamelsmu/evals-skills

Install one skill only:

npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit

Check for updates:

npx skills check
npx skills update

Available Skills

Skill	What it does
eval-audit	Audit an eval pipeline and surface problems with prioritized severity
error-analysis	Guide the user through reading traces and categorizing failures
generate-synthetic-data	Create diverse synthetic test inputs using dimension-based tuple generation
write-judge-prompt	Design LLM-as-Judge evaluators for subjective quality criteria
validate-evaluator	Calibrate LLM judges against human labels using data splits, TPR/TNR, and bias correction
evaluate-rag	Evaluate retrieval and generation quality in RAG pipelines
build-review-interface	Build custom annotation interfaces for human trace review

Invoke a skill with /evals-skills:skill-name, e.g., /evals-skills:error-analysis.

Write Your Own Skills

These skills are a starting point and only encode common mistakes that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own.

The meta-skill can help you ground custom skills.

Beyond These Skills

These skills handle the parts of eval work that generalize across projects. Much of the process doesn't: production monitoring, CI/CD integration, data analysis, and much more. The course covers all of it.

Similar Plugins

huggingface-skills

10.3k

Agent Skills for AI/ML tasks including dataset creation, model training, evaluation, and research paper publishing on Hugging Face Hub

Stats

Version0.2.0

Stars949

Forks102

Installs2

MaintenanceExcellent

LicenseMIT

AddedMar 22, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Available In

hamelsmu-evals-skills949

Eval Skills for AI Coding Agents

Skills that guide AI coding agents to help you build LLM evaluations.

New to Evals? Start Here

If you are new to evals, start with the eval-audit skill. Give your coding agent these instructions:

Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.

The audit isn't a complete solution, but it will catch common problems we've seen in evals. It will also recommend other skills to use to fix the problems.

Installation

In Claude Code, run these two commands:

# Step 1: Register the plugin repository
/plugin marketplace add hamelsmu/evals-skills

# Step 2: Install the plugin
/plugin install evals-skills@hamelsmu-evals-skills

To upgrade:

/plugin update evals-skills@hamelsmu-evals-skills

After installation, restart Claude Code. The skills will appear as /evals-skills:<skill-name>.

Installation (npx skills)

If you use the open Skills CLI, install from this repo with:

npx skills add https://github.com/hamelsmu/evals-skills

Install one skill only:

npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit

Check for updates:

npx skills check
npx skills update

Available Skills

Skill	What it does
eval-audit	Audit an eval pipeline and surface problems with prioritized severity
error-analysis	Guide the user through reading traces and categorizing failures
generate-synthetic-data	Create diverse synthetic test inputs using dimension-based tuple generation
write-judge-prompt	Design LLM-as-Judge evaluators for subjective quality criteria
validate-evaluator	Calibrate LLM judges against human labels using data splits, TPR/TNR, and bias correction
evaluate-rag	Evaluate retrieval and generation quality in RAG pipelines
build-review-interface	Build custom annotation interfaces for human trace review

Invoke a skill with /evals-skills:skill-name, e.g., /evals-skills:error-analysis.

Write Your Own Skills

The meta-skill can help you ground custom skills.

evals-skills

Component Overview

Component Details

Skills (7)

README

Eval Skills for AI Coding Agents

New to Evals? Start Here

Installation

Installation (npx skills)

Available Skills

Write Your Own Skills

Beyond These Skills

Similar Plugins

huggingface-skills

evals-skills

Component Overview

Component Details

Skills (7)

README

Eval Skills for AI Coding Agents

New to Evals? Start Here

Installation

Installation (npx skills)

Available Skills

Write Your Own Skills

Beyond These Skills

Similar Plugins

huggingface-skills

claude-token-reducer

agent-browser

fullstack-dev-skills

planning-with-files

payload