Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By hamelsmu
Build and validate LLM evaluation pipelines: design judge prompts, calibrate against human labels, generate synthetic test data, audit pipeline trustworthiness, analyze failure modes, evaluate RAG systems, and collect human annotations via a browser UI.
npx claudepluginhub hamelsmu/evals-skills --plugin evals-skillsBuild a custom browser-based annotation interface tailored to your data for reviewing LLM traces and collecting structured feedback. Use when you need to build an annotation tool, review traces, or collect human labels.
Help the user systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.
Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Agent and skill evaluation harness with MLflow integration
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
LangSmith skills for tracing, dataset management, and evaluation pipelines
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
Representation Synthesis workflow for auditing agent skills in Claude Code.
Editorial "Agent Architect" bundle for Claude Code from Antigravity Awesome Skills.
Automated code review loop: Claude implements, Codex reviews independently, Claude addresses feedback
Deep research across Claude and Codex in parallel with cross-pollination refinement
CLI tools for processing YouTube videos, Zoom recordings, and newsletters
Live Jupyter notebook kernel workflows for Claude Code
Reverse-engineer website internal APIs using Chrome browser automation. Discover endpoints, extract auth, and build CLI scripts for any website.
Skills that guide AI coding agents to help you build LLM evaluations.
These skills guard against common mistakes I've seen helping 50+ companies and teaching students in our AI Evals course. If you're new to evals, see questions.md for free resources on the fundamentals.
If you are new to evals, start with the eval-audit skill. Give your coding agent these instructions:
Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.
The audit isn't a complete solution, but it will catch common problems we've seen in evals. It will also recommend other skills to use to fix the problems.
In Claude Code, run these two commands:
# Step 1: Register the plugin repository
/plugin marketplace add hamelsmu/evals-skills
# Step 2: Install the plugin
/plugin install evals-skills@hamelsmu-evals-skills
To upgrade:
/plugin update evals-skills@hamelsmu-evals-skills
After installation, restart Claude Code. The skills will appear as /evals-skills:<skill-name>.
If you use the open Skills CLI, install from this repo with:
npx skills add https://github.com/hamelsmu/evals-skills
Install one skill only:
npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit
Check for updates:
npx skills check
npx skills update
| Skill | What it does |
|---|---|
| eval-audit | Audit an eval pipeline and surface problems with prioritized severity |
| error-analysis | Guide the user through reading traces and categorizing failures |
| generate-synthetic-data | Create diverse synthetic test inputs using dimension-based tuple generation |
| write-judge-prompt | Design LLM-as-Judge evaluators for subjective quality criteria |
| validate-evaluator | Calibrate LLM judges against human labels using data splits, TPR/TNR, and bias correction |
| evaluate-rag | Evaluate retrieval and generation quality in RAG pipelines |
| build-review-interface | Build custom annotation interfaces for human trace review |
Invoke a skill with /evals-skills:skill-name, e.g., /evals-skills:error-analysis.
These skills are a starting point and only encode common mistakes that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own.
The meta-skill can help you ground custom skills.
These skills handle the parts of eval work that generalize across projects. Much of the process doesn't: production monitoring, CI/CD integration, data analysis, and much more. The course covers all of it.