From latestaiagents
Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.
npx claudepluginhub latestaiagents/agent-skills --plugin skills-authoringThis skill uses the workspace's default tool permissions.
**Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.**
Designs LLM evaluation frameworks including test suites, human rubrics, automated evals, and metrics for quality, safety, accuracy, alignment.
Provides patterns for curating, versioning, validating quality, and integrating golden datasets into CI pipelines for AI/ML evaluations and LLM testing.
Checks eval dataset quality for size, difficulty distribution, dead examples, coverage, splits. Auto-corrects issues by assigning train/held_out splits or spawning agents to generate hard examples.
Share bugs, ideas, or general feedback.
Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.
Best to worst:
A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.
Split your dataset by categories that matter:
dataset:
categories:
simple_qa: 100 samples # easy, high-frequency
multi_step_reasoning: 50 # medium
ambiguous_queries: 30 # hard
edge_cases: 20 # adversarial
rare_domains: 20 # coverage of long tail
Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.
Two people label the same 50 items independently. Compute inter-annotator agreement:
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)
Target:
Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.
Write explicit guidelines with examples:
### Label: helpful
**Definition**: Response addresses the user's question directly and accurately.
**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO
If two labelers disagree, add their disputed case as a rubric example.
If your eval is in the model's training data, scores are inflated. Check:
For production evals, rotate the dataset yearly and keep a private held-out set.
Track difficulty via model pass rate:
< 30% pass: too hard; models improve but you can't measure it30-80% pass: useful range> 95% pass: too easy; dataset has plateauedPrune items that reach 100% for several consecutive model generations — they no longer discriminate.
When you need more coverage:
const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4000,
messages: [{ role: "user", content: prompt }],
});
Then:
Quality > quantity. 200 well-labeled items beat 5000 noisy ones.
Treat datasets like code:
evals/
customer_support/
v1/
dataset.jsonl
rubric.md
CHANGELOG.md
v2/
dataset.jsonl
rubric.md
CHANGELOG.md
Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".
Keep 100-200 items never published, never used for prompt iteration. Only for:
Rotate a fraction yearly.