A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Implements eval-driven development by defining, running, and tracking code evaluations.
/plugin marketplace add https://www.claudepluginhub.com/api/plugins/aventerica89-claude-codex/marketplace.json/plugin install aventerica89-claude-codex@cpd-aventerica89-claude-codexThis skill inherits all available tools. When active, it can use any tool Claude has access to.
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Eval-Driven Development treats evals as the "unit tests of AI development":
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
Deterministic checks using code:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
Flag for manual review:
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
"At least one success in k attempts"
"All k trials succeed"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
/eval check feature-name
Runs current evals and reports status
/eval report feature-name
Generates full eval report
Store evals in project:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines
## EVAL: add-authentication
### Phase 1: Define (10 min)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials rejected with proper error
- [ ] Sessions persist across page reloads
- [ ] Logout clears session
Regression Evals:
- [ ] Public routes still accessible
- [ ] API responses unchanged
- [ ] Database schema compatible
### Phase 2: Implement (varies)
[Write code]
### Phase 3: Evaluate
Run: /eval check add-authentication
### Phase 4: Report
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.