Help us improve
Share bugs, ideas, or general feedback.
From agentheim
Spawns a fresh-context verifier agent to inspect diffs against acceptance criteria after a worker returns SUCCESS, catching plausible-wrong, partial, or scope-drift implementations before commit.
npx claudepluginhub heimeshoff/agentheim --plugin agentheimHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentheim:verification-before-completionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The worker just returned `RESULT: SUCCESS`. Trust nothing yet. A worker self-reporting success has the strongest possible incentive to call its own work done — every cognitive bias is aligned against catching its own mistakes. This skill is the structural answer: a separate agent, fresh context, reading the diff against the acceptance criteria as if seeing the work for the first time.
Enforces running verification commands like tests, builds, linters before claiming work complete, bugs fixed, or tests pass. Includes red-green TDD cycles and agent delegation checks.
Enforces running verification commands (tests, linters, builds) before claiming work complete/fixed/passing or before commits/PRs. Ensures evidence-based assertions.
Enforces evidence-based verification by running fresh tests, builds, linters, reviewing outputs before claiming work done, committing, or PRing.
Share bugs, ideas, or general feedback.
The worker just returned RESULT: SUCCESS. Trust nothing yet. A worker self-reporting success has the strongest possible incentive to call its own work done — every cognitive bias is aligned against catching its own mistakes. This skill is the structural answer: a separate agent, fresh context, reading the diff against the acceptance criteria as if seeing the work for the first time.
LLM workers fail in three distinctive ways that internal self-checks rarely catch:
The verifier catches all three because it reads only the task spec and the diff — it has no exposure to the worker's reasoning trail and no investment in the solution.
If work ran the checks inline, it would do so in the same context that just dispatched the worker and is about to commit. That context carries momentum toward "ship it". A separately spawned agent reads only what it's handed — the task file, the BC README, the diff, the test output — and produces a verdict without that momentum. This is the load-bearing structural property; do not collapse it into a function call.
The work skill spawns verifier with:
doing/ for unverified-success tasks)git diff plus a list of changed files, or a generated patch attached as text).agentheim/vision.md, .agentheim/context-map.md, .agentheim/knowledge/decisions/ (verifier reads on demand)The verifier is explicitly NOT given:
In order, stopping at the first failing check:
Acceptance criteria coverage. Every - [ ] bullet in the task's ## Acceptance criteria section maps to either: (a) an executable test in the diff that would fail without the production code change, or (b) — for the legitimate TDD-skip categories — a concrete artifact the verifier can inspect (ADR file, config validation, integration smoke check, manual exercise note).
Test execution. If TESTS_ADDED > 0, the verifier runs the test suite and confirms TESTS_PASSING: yes is true now, not just at the moment the worker reported it.
Scope discipline. The diff touches only files the task implies. Out-of-scope changes are a FAIL — even when they look like good ideas, the verifier surfaces them as a candidate backlog item rather than approving them.
Ubiquitous language. Names introduced in the diff match the BC's README. A new term that doesn't appear in the README is a FAIL with a fix suggestion: add the term to the README first, or rename to match an existing term.
BC README sync. If the worker introduced new aggregates, events, commands, or invariants, did BC_README_UPDATED: yes and does the README actually reflect them? yes in the return block without a corresponding diff to the README is a FAIL.
ADRs for decisions. If the diff embeds a decision a future maintainer would ask about (library choice, pattern choice, an invariant chosen over alternatives), is there a corresponding ADR in ADRS_WRITTEN? Missing ADR for a real decision is a FAIL.
No protocol or git tampering. The diff must not touch .agentheim/knowledge/protocol.md (work owns it) and must contain no git operations in the worker's output. Violation is a FAIL — the worker broke a protocol rule.
The verifier returns one of three verdicts. Strict format — work parses these deterministically.
VERDICT: PASSThe diff is committable. work proceeds to move the task to done/ (if the worker didn't already), runs git add on the FILE_LIST and ancillary files, and commits.
VERDICT: PASS
TASK_ID: <id>
EVIDENCE: <one line per acceptance criterion, naming the test or artifact that covers it>
VERDICT: FAILThe diff is not committable. work rolls back the worker's claim of completion and re-dispatches.
VERDICT: FAIL
TASK_ID: <id>
REASONS:
- <one bullet per concrete defect, citing the file/line where possible>
SUGGESTED_FIX: <brief — what the next worker should do>
ITERATION_HINT: <"likely fixable with another worker pass" | "task is under-specified — consider bouncing to backlog">
VERDICT: SKIPRare. The task is type: decision with an ADR-only deliverable, or the verifier determines there is nothing executable to verify and reading the artifact against the spec is what the user should do, not the verifier. work treats this as PASS but logs it differently.
VERDICT: SKIP
TASK_ID: <id>
REASON: <why verification doesn't usefully apply to this task>
work does with each verdictThe operational integration lives in skills/work/SKILL.md. In short:
done/ (if needed), commit, log "Task verified and completed" to protocol.md.## Verifier note block, revert the task's frontmatter status: done back to status: doing, move it back from done/ to doing/ if the worker already moved it, log "Verification failed" to protocol.md, re-dispatch a worker on the same task with the verifier note included in its prompt. Hard cap: 2 re-dispatches per task.doing/ with all accumulated verifier notes. Log "Verification failed — escalating to user" to protocol.md. Surface at end of batch.The re-dispatch loop has a cap because beyond two retries you're almost always looking at an under-specified task that needs refinement (the modeller's job), not another execution attempt.
The user can disable verification for a work batch by invoking work with --no-verify or by saying "skip verification this run". This is for exploratory throwaway batches. The default is always verify; the opt-out is never persistent.
work also skips verification automatically when:
RESULT: BOUNCED or RESULT: FAILED (nothing to verify)type: decision AND the ADR was the only artifact AND FILES_CHANGED == 1 (just the ADR file) — auto-SKIP without spawning the verifier.When the worker followed test-driven-development, the verifier's first check (acceptance criteria coverage) becomes trivial — every criterion has a named test. When the worker skipped TDD, the verifier has to re-derive the test space and judge whether non-test evidence covers each criterion. Both flows are supported; TDD makes verification an order of magnitude cheaper.