From team-skills-platform
正式评估框架,实现 eval-driven development (EDD) 原则。 用于定义 pass/fail 标准、测量 pass@k 指标、创建回归测试套件。
npx claudepluginhub colin4k1024/tspThis skill uses the workspace's default tool permissions.
一个用于 Claude Code 会话的正式评估框架,实现 eval-driven development (EDD) 原则。
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
一个用于 Claude Code 会话的正式评估框架,实现 eval-driven development (EDD) 原则。
Eval-Driven Development 将评估视为"AI 开发的单元测试":
测试 Claude 能否做以前不能做的事:
[CAPABILITY EVAL: points-calculation]
Task: 计算用户积分并确定等级
Success Criteria:
- [ ] 积分正确累加
- [ ] 等级边界正确
- [ ] 权益解锁逻辑正确
Expected Output: 用户总积分 = 1500,等级 = L3
确保变更不破坏现有功能:
[REGRESSION EVAL: login-flow]
Baseline: sha-abc123
Tests:
- existing-login: PASS
- session-management: PASS
- logout-flow: PASS
Result: 3/3 passed (previously 3/3)
使用代码的确定性检查:
# 检查文件是否包含预期模式
grep -q "export function handlePoints" src/points.ts && echo "PASS" || echo "FAIL"
# 检查测试是否通过
npm test -- --testPathPattern="points" && echo "PASS" || echo "FAIL"
使用 Claude 评估开放式输出:
[MODEL GRADER PROMPT]
评估以下代码变更:
1. 它是否解决了陈述的问题?
2. 结构是否良好?
3. 边界情况是否处理?
4. 错误处理是否适当?
Score: 1-5 (1=差, 5=优秀)
Reasoning: [解释]
标记为手动审查:
[HUMAN REVIEW REQUIRED]
Change: 描述变更内容
Reason: 为什么需要人工审查
Risk Level: LOW/MEDIUM/HIGH
"k 次尝试中至少一次成功"
"所有 k 次试验都成功"
## EVAL DEFINITION: points-system
### Capability Evals
1. 可以计算用户积分
2. 可以确定用户等级
3. 可以解锁权益
### Regression Evals
1. 现有登录仍然有效
2. 会话管理未改变
3. 登出流程完整
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
编写代码通过定义的评估。
# 运行 capability evals
[Run each capability eval, record PASS/FAIL]
# 运行 regression evals
npm test -- --testPathPattern="existing"
# 生成报告
EVAL REPORT: points-system
==========================
Capability Evals:
calculate-points: PASS (pass@1)
determine-level: PASS (pass@2)
unlock-benefits: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
在项目中存储评估:
.claude/
evals/
points-system.md # 评估定义
points-system.log # 评估运行历史
baseline.json # 回归基线