Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
From superpowers-eccnpx claudepluginhub aman-2709/superpowers-ecc --plugin superpowers-eccThis skill uses the workspace's default tool permissions.
Guides AI-assisted editing of real video footage: transcribe/plan cuts with Claude, execute via FFmpeg bash scripts, augment with Remotion/ElevenLabs/fal.ai, polish in Descript/CapCut.
Ingests video/audio from files, URLs, RTSP, desktop; indexes/searches moments with timestamps/clips; transcodes/edits timelines (subtitles/overlays/dubbing); generates assets and live alerts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Eval-Driven Development treats evals as the "unit tests of AI development":
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
Deterministic checks using code. Use when outputs are predictable and verifiable programmatically.
When to use: Deterministic outputs, parsing tasks, structured data extraction, code generation with testable behavior.
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
# Exact output comparison
diff <(node generate.js) expected-output.txt && echo "PASS" || echo "FAIL"
Strengths: Fully reproducible, zero cost per eval run, no grader variance. Weaknesses: Cannot assess quality, style, or partial correctness.
Regex and schema constraints for structured validation.
When to use: Outputs that must conform to a schema, format, or pattern but exact content varies.
# JSON schema validation
ajv validate -s schema.json -d output.json && echo "PASS" || echo "FAIL"
# Regex pattern matching
echo "$OUTPUT" | grep -qP '^\d{4}-\d{2}-\d{2}T' && echo "PASS" || echo "FAIL"
# Structural checks
jq '.data | length > 0' output.json && echo "PASS" || echo "FAIL"
Strengths: Flexible enough for variable outputs, still deterministic. Weaknesses: Cannot assess semantic meaning or quality.
Use Claude or another LLM to evaluate open-ended outputs against a rubric.
When to use: Subjective quality assessment, creative tasks, code review quality, explanation clarity, tasks where "correct" is a spectrum.
[MODEL GRADER PROMPT]
Evaluate the following code change against these criteria:
1. Does it solve the stated problem? (0-2 points)
2. Is it well-structured and readable? (0-2 points)
3. Are edge cases handled? (0-2 points)
4. Is error handling appropriate? (0-2 points)
5. Would this pass code review? (0-2 points)
Score: X/10
PASS threshold: >= 7/10
Reasoning: [explanation]
Strengths: Can assess nuance, quality, and partial correctness. Weaknesses: Non-deterministic (run multiple times), adds cost per eval, grader itself can have biases.
Best practices for LLM-as-judge:
Compare output meaning against a reference, tolerating surface-level variation.
When to use: Meaning preservation tasks (summarization, paraphrasing, translation), information extraction where phrasing varies, any task where multiple correct answers exist.
# Using embeddings for semantic comparison
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
reference = "The function authenticates users via JWT tokens"
output = "User authentication is handled through JSON Web Tokens"
ref_embedding = model.encode(reference)
out_embedding = model.encode(output)
similarity = np.dot(ref_embedding, out_embedding) / (
np.linalg.norm(ref_embedding) * np.linalg.norm(out_embedding)
)
# Typical thresholds:
# > 0.85: strong match (PASS for paraphrasing)
# > 0.70: reasonable match (PASS for summarization)
# < 0.50: likely different meaning (FAIL)
print("PASS" if similarity > 0.85 else "FAIL")
Strengths: Tolerates rephrasing, catches meaning drift. Weaknesses: Threshold tuning required, embedding model quality matters.
Flag for manual review when automated grading is insufficient.
When to use: Security-sensitive changes, ambiguous requirements, establishing initial baselines before automating.
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
| Task Type | Recommended Grader | Example |
|---|---|---|
| Code compiles/tests pass | Code-based (exact match) | Build verification |
| Output matches format | Rule-based | API response shape |
| Code quality/readability | LLM-as-judge | Refactoring tasks |
| Summary accuracy | Semantic similarity | Documentation generation |
| Creative writing | LLM-as-judge | Content generation |
| Data extraction | Code-based + rule-based | Parsing structured data |
| Security changes | Human + code-based | Auth modifications |
| Translation quality | Semantic similarity + LLM-as-judge | Localization |
"At least one success in k attempts" — the primary reliability metric for AI-assisted tasks.
How to calculate:
pass@k = 1 - C(n-c, k) / C(n, k)
where:
n = total number of samples
c = number of correct samples
k = number of attempts considered
How to interpret:
Sample size guidance:
"All k trials succeed" — a stricter reliability bar.
Relationship: pass^k = (pass@1)^k approximately, so:
| Eval Type | Metric | Target | Rationale |
|---|---|---|---|
| Capability evals | pass@3 | >= 0.90 | Agent can do the task reliably with retries |
| Regression evals | pass^3 | = 1.00 | Existing functionality must never break |
| Release-critical | pass@1 | >= 0.95 | High confidence without retries |
| Exploratory/new | pass@3 | >= 0.50 | Still learning, tracking improvement |
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
/eval check feature-name
Runs current evals and reports status
/eval report feature-name
Generates full eval report
Store evals in project:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines
## EVAL: add-authentication
### Phase 1: Define (10 min)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials rejected with proper error
- [ ] Sessions persist across page reloads
- [ ] Logout clears session
Regression Evals:
- [ ] Public routes still accessible
- [ ] API responses unchanged
- [ ] Database schema compatible
### Phase 2: Implement (varies)
[Write code]
### Phase 3: Evaluate
Run: /eval check add-authentication
### Phase 4: Report
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
.claude/evals/<feature>.md # Eval definition
.claude/evals/<feature>.log # Eval run history
docs/releases/<version>/eval-summary.md # Release snapshot