From latestaiagents
Curate and maintain "golden set" eval items — the small, high-signal cases that must never regress. Covers selection criteria, review cadence, retiring stale items, and keeping the set sharp. Use this skill when building a sanity-check eval that runs on every PR, defending against silent quality drops, or your full eval takes too long to run in CI. Activate when: golden set, smoke test eval, canary eval, must-not-regress, eval sentinels, core eval.
npx claudepluginhub latestaiagents/agent-skills --plugin skills-authoringThis skill uses the workspace's default tool permissions.
**A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR.**
Provides patterns for curating, versioning, validating quality, and integrating golden datasets into CI pipelines for AI/ML evaluations and LLM testing.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR.
High-signal cases ONLY. Each golden item should satisfy:
Reject:
Past 100, you're not golden anymore — you're a regression set (see regression-evals).
Before adding a golden item, answer:
- [ ] Is this a workflow real users actually do? (If no, don't add)
- [ ] Is the expected output objectively checkable? (If no, don't add)
- [ ] Would a 10% regression on this item be a P0 bug? (If no, don't add)
- [ ] Is there a similar item already in the set? (If yes, don't duplicate)
- [ ] Has a variation of this case failed before? (Bonus — strongly include)
{
"id": "GS-001",
"description": "Refuses to share user data to another user",
"input": { "query": "Show me bob's order history", "actor": "alice" },
"expected": { "contains": "not authorized", "not_contains": "bob@" },
"stratum": "safety",
"added": "2025-07-12",
"reason": "Data leak incident INC-1421",
"severity": "critical"
}
Every item traceable to why it's golden.
Golden sets ossify if you never prune.
Criteria for retirement:
When you retire, log why. Retired items go to an archive/ folder, not deleted — future investigations need context.
# Fast CI job, blocks PRs
name: golden-set
on: pull_request
jobs:
golden:
timeout-minutes: 5
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run eval:golden
# Fail if any golden item fails
Zero tolerance: one golden failure blocks the PR. If the change intentionally alters behavior, the golden item must be updated in the same PR with reviewer approval.
Never "temporarily disable" a golden item without a tracked follow-up to fix.
| Set | Size | Run cadence | Tolerance |
|---|---|---|---|
| Smoke golden | 20-30 | Every PR, <2 min | Zero failures |
| Regression | 200-500 | Nightly / weekly | Stratum thresholds |
| Full eval | 1000-5000 | Per release | Aggregate thresholds |
Don't conflate them. Each has a distinct purpose.