Skill

eval-dataset-design

Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.

npx claudepluginhub latestaiagents/agent-skills --plugin skills-authoring

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.**

SKILL.md

Similar Skills

eval-designer

Designs LLM evaluation frameworks including test suites, human rubrics, automated evals, and metrics for quality, safety, accuracy, alignment.

nickcrew-claude-ctx-plugin

golden-dataset

133

Provides patterns for curating, versioning, validating quality, and integrating golden datasets into CI pipelines for AI/ML evaluations and LLM testing.

20 files5 tools

ork

harness:health

Checks eval dataset quality for size, difficulty distribution, dead examples, coverage, splits. Auto-corrects issues by assigning train/held_out splits or spawning agents to generate hard examples.

8 tools

harness-evolver

Stats

Stars2

Forks0

Last CommitApr 15, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Eval Dataset Design

Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.

When to Use

Starting an eval program from zero
Your evals pass but users still hit issues → coverage gap
Labels are inconsistent across reviewers → quality problem
Adding evals for a new feature or domain

Dataset Properties Worth Optimizing

Coverage — representative of real user queries
Difficulty distribution — mix of easy/medium/hard, not all easy
Label consistency — two humans agree on the label
Stability — same inputs → same evaluable outputs over time
Uncontaminated — not in the model's training data

Sourcing Inputs

Best to worst:

Real user queries (anonymized) — highest signal
Synthetic queries generated from real templates — fills gaps
Adversarial queries hand-crafted for known failure modes
Existing benchmarks — context, but often contaminated and dated

A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.

Stratification

Split your dataset by categories that matter:

dataset:
  categories:
    simple_qa: 100 samples        # easy, high-frequency
    multi_step_reasoning: 50       # medium
    ambiguous_queries: 30          # hard
    edge_cases: 20                 # adversarial
    rare_domains: 20               # coverage of long tail

Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.

Labeling Quality

Two people label the same 50 items independently. Compute inter-annotator agreement:

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)

Target:

κ > 0.8: excellent, labels are reliable
κ 0.6-0.8: good, some ambiguity
κ < 0.6: rewrite your labeling rubric — humans can't agree, so neither can models

Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.

Labeling Rubric

Write explicit guidelines with examples:

### Label: helpful

**Definition**: Response addresses the user's question directly and accurately.

**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO

If two labelers disagree, add their disputed case as a rubric example.

Contamination

If your eval is in the model's training data, scores are inflated. Check:

Hash the query and search public datasets / GitHub / web — common sources
Ask the model to complete the eval query's preamble — if it auto-completes with the expected answer, it's memorized
Regenerate with paraphrasing — rewrite queries so training data near-matches become mismatches

For production evals, rotate the dataset yearly and keep a private held-out set.

Difficulty Calibration

Track difficulty via model pass rate:

< 30% pass: too hard; models improve but you can't measure it
30-80% pass: useful range
> 95% pass: too easy; dataset has plateaued

Prune items that reach 100% for several consecutive model generations — they no longer discriminate.

Synthetic Generation

When you need more coverage:

const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;

const response = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 4000,
  messages: [{ role: "user", content: prompt }],
});

Then:

Human-review every synthetic query for realism
Label them the same way as real queries
Track whether synthetic vs real have different score profiles — a red flag if they diverge

Size

Smoke test: 20-50 items, run in CI
Regression set: 200-500 items, run weekly
Full eval: 1000-5000 items, run per major release
Beyond that: sampling with stratification, not more volume

Quality > quantity. 200 well-labeled items beat 5000 noisy ones.

Versioning

Treat datasets like code:

evals/
  customer_support/
    v1/
      dataset.jsonl
      rubric.md
      CHANGELOG.md
    v2/
      dataset.jsonl
      rubric.md
      CHANGELOG.md

Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".

Private Held-Out Set

Keep 100-200 items never published, never used for prompt iteration. Only for:

Measuring generalization on unseen examples
Catching overfitting to your public eval

Rotate a fraction yearly.

Anti-Patterns

Evals that only cover easy cases — model passes your eval, fails in prod
Single labeler — no way to know if labels are noisy
No versioning — silent edits invalidate historical trends
Overfit to eval — you tuned prompts against the eval set; now it doesn't generalize
All real or all synthetic — synthetic misses distribution, real misses edge cases

Best Practices

Mix real + synthetic + adversarial; stratify and report per category
Label with a written rubric; measure inter-annotator kappa
Check for training-data contamination; paraphrase suspicious queries
Track item-level difficulty; prune items that hit 100% pass
Version datasets like code; publish CHANGELOGs
Keep a private held-out set for generalization checks
Quality > quantity; 200 good labels beat 5000 rushed ones