Help us improve
Share bugs, ideas, or general feedback.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kernel:evalThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<skill id="eval">
Implements eval-driven development (EDD) framework for Claude Code sessions with capability/regression evals, pass@k metrics, and code/model/human graders for agent reliability.
Implements eval-driven development (EDD) for Claude Code sessions: capability/regression evals, code/model/human graders, pass@k metrics. Use to define success before coding.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Share bugs, ideas, or general feedback.
<core_principles>
<eval_types>
[CAPABILITY EVAL: semantic-search]
Task: Search markets using natural language
Success Criteria:
- [ ] Returns relevant results for query
- [ ] Handles empty query gracefully
- [ ] Falls back when vector DB unavailable
Expected: Top 5 results match query intent
[REGRESSION EVAL: auth-flow]
Baseline: commit abc123
Tests:
- login-with-valid-creds: PASS
- login-with-invalid-creds: PASS
- session-persistence: PASS
Result: 3/3 passed (unchanged)
</eval_types>
<grader_types>
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
npm run build && echo "PASS" || echo "FAIL"
[MODEL GRADER]
Evaluate: Does this code solve the stated problem?
Criteria: correctness, structure, edge case handling
Score: 1-5
Reasoning: [required]
[HUMAN REVIEW REQUIRED]
Change: Added payment processing
Reason: Security-critical, requires human verification
Risk: HIGH
</grader_types>
pass@k: "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts (typical target: > 90%)pass^k: "All k trials succeed"
<eval_report_format>
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3
Regression Evals:
login-flow: PASS
session-mgmt: PASS
Overall: 2/2
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
</eval_report_format>
<anti_patterns> Writing evals after implementation tests existing bugs, not requirements. Model-based grading is slow and probabilistic. Prefer code graders. Every change must pass regression evals. No exceptions. Evals that take > 30s get skipped. Keep them fast. Track pass@k over time. Declining reliability is a signal. </anti_patterns>
<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'
Record eval type, pass rates, and any failures for future reference. </on_complete>