Stats
Actions
Tags
From Forge
Eval-driven development for AI features. Triggers on /forge:eval or when any LLM-powered feature, prompt, or agent changes. Define pass/fail before implementing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/forge:evalThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
| Phase | Action | Gate |
| Phase | Action | Gate |
|---|---|---|
| DEFINE | Write eval spec before any code | STOP: show user, wait for approval |
| IMPLEMENT | Write code/prompts to pass the spec | n/a |
| GRADE | Run graders, measure pass@k | n/a |
| REPORT | Compare to baseline, flag regressions | Block on any regression |
## EVAL: <feature-name>
### Capability
- [ ] <what must succeed>
### Regression
- [ ] <existing behavior that must not break>
### Metrics
- Capability: pass@3 > 90%
- Regression: pass^3 = 100%
Save at .claude/evals/<feature-name>.md.
| Output | Grader | Example |
|---|---|---|
| Deterministic | Code (bash) | npm test && echo PASS || echo FAIL |
| Open-ended | Model prompt | Rate 1-3: solves problem / correct format / handles edges |
| Metric | Meaning | Target |
|---|---|---|
| pass@1 | First-attempt success | > 70% |
| pass@3 | At least 1 success in 3 | > 90% |
| pass^3 | All 3 succeed | 100% on critical paths |
FORGE EVAL — <date>
Feature Capability Regression Status
<name> X/Y (N%) X/Y (N%) PASS|FAIL|REGRESSED
Baseline delta: <vs last run>
npx claudepluginhub prashantxo/forgeCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.