Eval-Driven Development (EDD) for AI workflows. pass@k metrics, capability evals, regression evals. Triggers: eval, edd, pass@k, capability, regression, benchmark.
From kernelnpx claudepluginhub ariaxhan/kernel-claude --plugin kernelThis skill is limited to using the following tools:
reference/eval-research.mdGuides AI-assisted editing of real video footage: transcribe/plan cuts with Claude, execute via FFmpeg bash scripts, augment with Remotion/ElevenLabs/fal.ai, polish in Descript/CapCut.
Ingests video/audio from files, URLs, RTSP, desktop; indexes/searches moments with timestamps/clips; transcodes/edits timelines (subtitles/overlays/dubbing); generates assets and live alerts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
<core_principles>
<eval_types>
<!-- Capability Eval: Can it do something new? -->[CAPABILITY EVAL: semantic-search]
Task: Search markets using natural language
Success Criteria:
- [ ] Returns relevant results for query
- [ ] Handles empty query gracefully
- [ ] Falls back when vector DB unavailable
Expected: Top 5 results match query intent
<!-- Regression Eval: Does existing functionality still work? -->
[REGRESSION EVAL: auth-flow]
Baseline: commit abc123
Tests:
- login-with-valid-creds: PASS
- login-with-invalid-creds: PASS
- session-persistence: PASS
Result: 3/3 passed (unchanged)
</eval_types>
<grader_types>
<!-- Code-Based Grader (preferred) -->grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
npm run build && echo "PASS" || echo "FAIL"
<!-- Model-Based Grader (open-ended outputs) -->
[MODEL GRADER]
Evaluate: Does this code solve the stated problem?
Criteria: correctness, structure, edge case handling
Score: 1-5
Reasoning: [required]
<!-- Human Grader (security, UX) -->
[HUMAN REVIEW REQUIRED]
Change: Added payment processing
Reason: Security-critical, requires human verification
Risk: HIGH
</grader_types>
<metrics> pass@k: "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts (typical target: > 90%)pass^k: "All k trials succeed"
<eval_report_format>
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3
Regression Evals:
login-flow: PASS
session-mgmt: PASS
Overall: 2/2
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
</eval_report_format>
<anti_patterns> <block id="eval_after_code">Writing evals after implementation tests existing bugs, not requirements.</block> <block id="model_grader_overuse">Model-based grading is slow and probabilistic. Prefer code graders.</block> <block id="skip_regression">Every change must pass regression evals. No exceptions.</block> <block id="slow_evals">Evals that take > 30s get skipped. Keep them fast.</block> <block id="no_tracking">Track pass@k over time. Declining reliability is a signal.</block> </anti_patterns>
<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":["<list>"]}'
Record eval type, pass rates, and any failures for future reference. </on_complete>
</skill>