npx claudepluginhub gregoryho/arcforge --plugin arcforgeThis skill uses the workspace's default tool permissions.
Measure whether skills, agents, and workflows actually change AI agent behavior. Define scenarios, prepare environments, run trials, grade results, track regressions.
agents/eval-analyzer.mdagents/eval-blind-comparator.mdagents/eval-grader.mddashboard/__tests__/eval-dashboard.test.jsdashboard/__tests__/ui-test-plan.mddashboard/eval-dashboard-ui.htmldashboard/eval-dashboard.jsevals/evals.jsonreferences/audit-workflow.mdreferences/cli-and-metrics.mdreferences/common-mistakes-catalog.mdreferences/grading-and-execution.mdreferences/preflight.mdreferences/verdict-policy.mdMandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Measure whether skills, agents, and workflows actually change AI agent behavior. Define scenarios, prepare environments, run trials, grade results, track regressions.
Core principle: "Unit tests for AI agent behavior" — if you can't measure improvement, you can't ship with confidence.
Key distinction: You are evaluating AI agents (LLM + tools), not just LLM text output. Agents use tools, read files, search codebases. Your eval environment must account for this.
Eval is required when:
Not required when: the change has no behavioral footprint (reformatting, typos, metadata-only edits). When in doubt, run the eval — it is cheaper than shipping a regression.
Does skill X change agent behavior?
delta (improvement between baseline and treatment)Does agent Y produce correct output?
pass@k (reliability across k trials)Does the full toolkit system improve agent outcomes?
delta, pass^k for critical pathsUnlike skill evals (which vary the prompt), workflow evals vary the environment while keeping the identical prompt for both conditions.
Before designing any scenario, confirm scope:
| Question | Scope | What Varies | Primary Signal |
|---|---|---|---|
| Does this instruction change agent behavior? | skill | Skill present vs absent | delta |
| Can this agent complete the task correctly? | agent | Trial-to-trial execution | pass@k, pass^k |
| Does the toolkit improve outcomes? | workflow | Bare agent vs full toolkit | delta, pass^k |
| Does this component work correctly? | none | N/A | Use unit/integration tests |
Do NOT proceed to scenario design until you can answer question 2 in one sentence.
1. Preflight → validate scenario is still discriminative
2. Define eval → scenario + assertions + grader type
3. Prepare env → setup the trial environment (files, tools, context)
4. Run eval → spawn agent with scenario, capture transcript
5. Grade eval → code grader, model grader, or human grader
6. Track results→ pass@k metric over time (JSONL)
7. Report → SHIP / NEEDS WORK / BLOCKED / INSUFFICIENT_DATA
REQUIRED BACKGROUND: references/preflight.md — ceiling threshold (0.8), PASS/BLOCK semantics, scenario hash mechanics.
REQUIRED BACKGROUND: references/verdict-policy.md — full verdict enum (SHIP, NEEDS WORK, BLOCKED, IMPROVED, REGRESSED, NO_CHANGE, INSUFFICIENT_DATA), why k<5 triggers INSUFFICIENT_DATA, asymmetric delta thresholds.
Before writing assertions, complete this checklist:
Scenario validity rules:
arc eval ab owns the A/B loop — it runs the same single-condition scenario twice.See references/grading-and-execution.md for environment setup, trial execution, isolation mechanics, and result tracking. See references/cli-and-metrics.md for CLI commands, metrics, and the scenario template.
Three graders: code (deterministic checks), model (intent/quality/reasoning), human (audience-dependent taste or domain expertise). Match grader to assertion nature — not convenience.
Grader selection principle: Structured output (JSON, typed fields) does not make semantic quality deterministic. An agent can return valid JSON while producing poor analysis. Code-grade structure; model-grade quality.
Model/human grader calibration: One vague model-grader preference is not release evidence. For semantic release claims, use a task-derived rubric with anchors, repeated trials, CI/variance/agreement checks, and blind comparison, human spot-check, or independent adjudication. Treat model-grader output as noisy semantic evidence, not deterministic proof.
Deterministic proxy warning: Keyword, regex, and JSON-schema checks cover facts/fields, not critique quality. If a proxy can pass a shallow or adversarial answer, tighten it with negative fixtures/traps or model/human-grade the quality claim.
Report behavior separately from operational cost. A treatment can be correct but slower, more verbose, or pricier. Preserve duration/token/cost deltas when available, and do not hide operational regressions behind a passing behavioral verdict.
| Verdict | Meaning |
|---|---|
| SHIP | Code-graded: pass rate = 100%. Model-graded: CI95 lower bound ≥ 0.8 |
| NEEDS WORK | 60% ≤ pass rate < SHIP threshold |
| BLOCKED | pass rate < 60% |
| INSUFFICIENT_DATA | k < 5 — CI95 cannot be computed. Run more trials. |
Full verdict semantics in references/verdict-policy.md.
When pressure builds to skip or shortcut eval, these rationalizations surface. Each is a blocker in disguise.
| Excuse | Reality |
|---|---|
| "This change is too small to eval" | Size does not predict behavioral impact. A one-line prompt change can flip a verdict. Run eval — it takes minutes. |
| "Time pressure, ship now and eval later" | Eval done after shipping is a postmortem, not a gate. Ship with evidence or do not ship. |
| "Preflight blocks — I'll skip it just this once" | Preflight blocked because the scenario is no longer discriminative. Bypassing it means you cannot measure anything. Redesign the scenario. |
| "k=4 is close enough to 5" | The CI95 requires k ≥ 5 to be statistically meaningful. k=4 produces INSUFFICIENT_DATA. Run one more trial. |
| "INSUFFICIENT_DATA is advisory — I'll ship anyway" | INSUFFICIENT_DATA means you have no valid statistical basis for a verdict. Shipping on INSUFFICIENT_DATA is shipping blind. |
| "The grader raised weak_assertions but the pass rate is fine" | weak_assertions signal the assertions are not testing the right thing. A passing score on a poorly designed assertion proves nothing. Redesign the assertion. |
REQUIRED BACKGROUND: references/audit-workflow.md — how promotion and retirement arbitration works for discovered_claims and weak_assertions.
Every listed thought means stop, re-read the skill, do not proceed.
Top mistakes that waste the most eval runs. Full catalog in references/common-mistakes-catalog.md.
| Mistake | What Happens | Fix |
|---|---|---|
| Scenario before question | Mixing adherence, correctness, and toolkit effects in one noisy test | State the question first: behavior change, task outcome, or toolkit effect |
| Baseline already near ceiling | Both conditions pass, delta stays tiny | Run 2-3 pilot trials first; if baseline exceeds ~0.8, redesign |
| Skill formalizes behavior agent already exhibits | A/B delta is zero — behavior is generic competence, not skill-specific | Ask "would baseline behave differently without this skill?" If no, use workflow or agent eval |
| Prompt leaks the repair pattern | Baseline follows the template and scores high without the skill | Remove explicit grader split or named repair structure from the prompt |
| Code-grading skill adherence via competence proxy | Both conditions pass, delta is zero | Mentally run the code grader against a bare agent — if it still passes, the artifact isn't discriminative |
Using --skill-file for workflow eval | Varies the prompt instead of the environment | Workflow A/B varies the environment — use eval ab <name> without --skill-file |
| Workflow eval with no plugins installed | Baseline and treatment are identical, delta is always 0 | Ensure toolkit plugin is installed: claude plugin list should show active plugins |
Before:
After:
evals/benchmarks/latest.jsonNumeric vs qualitative analysis: Numeric comparison (delta, CI, verdict) is programmatic — the harness computes it. The eval-analyzer agent adds qualitative analysis for model/human-graded A/B results; it does not replace the programmatic verdict.
Reference files: