From dm-work
Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.
npx claudepluginhub rbergman/dark-matter-marketplace --plugin dm-workThis skill uses the workspace's default tool permissions.
Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.
The orchestrator calls the evaluator in these situations:
evaluate <bead-id> to test an existing feature against its criteria/dm-work:post-merge runs evaluator against closed beadsSkip evaluator when:
The intent review and evaluator have complementary scope:
If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.
Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.
BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>
CODE DIFF:
<git diff output or summary of changes>
EVALUATION PROCESS:
1. Classify each criterion:
- RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
- CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")
2. If browser-qa available (CDT MCP connected, app running at <url>):
- Activate dm-work:browser-qa
- For each RUNTIME criterion: navigate, interact, assert
- For each CODE criterion: inspect the diff
3. If browser-qa NOT available:
- For each CODE criterion: inspect the diff
- For each RUNTIME criterion: mark UNTESTABLE with reason
- If ALL criteria are UNTESTABLE: return early with overall: SKIP
4. Grade each criterion: PASS / FAIL / UNTESTABLE
- PASS: criterion is satisfied (code or runtime evidence)
- FAIL: criterion is not satisfied (describe what's wrong)
- UNTESTABLE: cannot verify without runtime / missing prerequisite
SKILLS: dm-work:browser-qa (if CDT MCP available)
OUTPUT FORMAT (JSON to stdout):
{
\"bead_id\": \"<id>\",
\"criteria_results\": [
{
\"criterion\": 1,
\"text\": \"User can navigate to /settings\",
\"type\": \"RUNTIME\",
\"result\": \"PASS\",
\"detail\": \"Navigated to /settings, page loads with profile form visible\"
},
{
\"criterion\": 2,
\"text\": \"Email validates client-side\",
\"type\": \"RUNTIME\",
\"result\": \"FAIL\",
\"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
}
],
\"overall\": \"FAIL\",
\"pass_count\": 1,
\"fail_count\": 1,
\"untestable_count\": 0,
\"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}
RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")
The orchestrator processes evaluator output:
overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →
50% failures: likely a spec problem — escalate to user, don't iterate
bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>
Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.
Not all projects use browser-qa. The evaluator should adapt:
| Project type | Verification method | Evaluator behavior |
|---|---|---|
| Standard web app | browser-qa (CDT MCP) | Full runtime evaluation |
| WebGL / Canvas game | Manual screenshots + human verification | Mark runtime criteria UNTESTABLE; take screenshots if CDT available for visual reference, but can't assert on canvas content |
| Native iOS/Android | Maestro or platform-specific tools | Mark runtime criteria UNTESTABLE unless project has automated UI test tooling wired |
| CLI tool | Bash execution + output assertion | Code-only evaluation; test commands via bash, not browser |
| API / backend | curl / httpie + response assertion | Code-only for endpoints; evaluate_script or direct API calls |
When runtime verification isn't possible, the evaluator should:
| Component | How evaluator connects |
|---|---|
| Orchestrator | Calls evaluator as Step 1.5 in post-subagent verification |
| Browser-qa | Evaluator activates browser-qa skill for standard web apps |
| Beads | Reads acceptance criteria from bead; files new beads for failures |
| Sprint contracts | Acceptance criteria in bead ARE the sprint contract |
| Post-merge review | Post-merge command uses evaluator for closed beads |
| Intent review | Complementary: intent checks code coverage, evaluator checks behavior |