From ftitos-claude-code
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ftitos-claude-code:eval-harnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Eval-Driven Development treats evals as the "unit tests of AI development":
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
Expected Output: Description of expected result
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)
Deterministic checks using code:
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
Use Claude to evaluate open-ended outputs:
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
Score: 1-5
Flag for manual review when automated grading is insufficient.
"At least one success in k attempts"
"All k trials succeed"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
### Regression Evals
1. Existing login still works
2. Session management unchanged
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
Run each eval, record PASS/FAIL.
EVAL REPORT: feature-xyz
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
Overall: 2/2 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
Overall: 2/2 passed
Metrics:
pass@1: 50% (1/2)
pass@3: 100% (2/2)
Status: READY FOR REVIEW
npx claudepluginhub nassimbf/ftitos-claude-codeCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.