From spec-ops
Fourth companion to `write-spec`, `refine-spec`, and `launch-spec`. `write-spec` drafts the spec; `refine-spec` verifies and tightens the **spec document**; `launch-spec` compiles the ready spec into a `/goal` driver that implements it; `verify-spec` verifies **reality against the claims** — whatever asserts them (a spec, an audit, or your own summary). It confirms that what was actually built (the committed code, the live system) matches what was claimed, grounding every claim in real source. It runs over **as many passes as it takes** until every claim has a definitive, evidence-backed verdict and a fresh judge agrees the verification is complete. **It edits nothing** — not code, not the spec — it grounds claims, reports discrepancies, and stops. Do not stop after one pass, and do not start fixing things unless the user asks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/spec-ops:verify-specThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Fourth companion to `write-spec`, `refine-spec`, and `launch-spec`. `write-spec` drafts the spec; `refine-spec` verifies and tightens the **spec document**; `launch-spec` compiles the ready spec into a `/goal` driver that implements it; `verify-spec` verifies **reality against the claims** — whatever asserts them (a spec, an audit, or your own summary). It confirms that what was actually built ...
Fourth companion to write-spec, refine-spec, and launch-spec. write-spec drafts the spec; refine-spec verifies and tightens the spec document; launch-spec compiles the ready spec into a /goal driver that implements it; verify-spec verifies reality against the claims — whatever asserts them (a spec, an audit, or your own summary). It confirms that what was actually built (the committed code, the live system) matches what was claimed, grounding every claim in real source. It runs over as many passes as it takes until every claim has a definitive, evidence-backed verdict and a fresh judge agrees the verification is complete. It edits nothing — not code, not the spec — it grounds claims, reports discrepancies, and stops. Do not stop after one pass, and do not start fixing things unless the user asks.
Arguments: $ARGUMENTS
@-mention, PR / commit range, feature, or claim in the arguments. If none is given, ask with AskUserQuestion: what work or claim should I verify, and against which branch / commit? (That named branch / commit is the HEAD you ground against.)The thing under review is the hypothesis, not the evidence. A spec, audit, checklist, or prior summary is exactly what you are CHECKING — never what you check against. Ground every claim against the actual codebase at branch HEAD, the git history, and (for infra/ops) live read-only system state. Read the target in full first; if you are verifying a spec's implementation, read the spec to extract its claims, but treat it as a list of things to prove, not a source of truth.
When the target is a spec, a prior clean verification may be on record as a drift baseline — a /tmp snapshot of each AC's last verdict + method + evidence and the HEAD sha it was verified at — so a re-run re-grounds only what actually moved and surfaces regressions. The baseline is written automatically by the Stop hook on a clean pass (you never write it); you only read it at the start of a run. It is single-machine and ephemeral (it lives in /tmp, not git) — by design, since implementation and verification run on the same machine.
At the start of a spec verification:
specPath (the spec's absolute path) in the ledger, so this run's result is recorded as the next baseline.python3 "${CLAUDE_PLUGIN_ROOT}/skills/verify-spec/drift_baseline.py" load <abs-spec-path>
Drift mode — classify each baseline criterion before grounding it. Scale the staleness check to the AC's recorded method:
cli-observation → always STALE. Live infra/system state drifts with no commit, so a git diff can't vouch for it — re-ground it.static-read / measurement / exhaustive-check / test-run → run git diff <verifiedAtSHA>..HEAD -- <evidence file> (the file(s) the recorded evidence cites):
evidence + method and noting "unchanged since <verifiedAtSHA>" — the empty diff is the fresh proof the old evidence still holds.<verifiedAtSHA> is unreachable (history was rewritten) or you cannot cleanly identify the evidence file from the recorded evidence, treat the AC as STALE and re-ground it. Never carry a verdict you cannot prove still holds.Then run the normal loop for the STALE criteria and for any AC in the spec that is not in the baseline (the spec may have gained criteria — ground those fresh). FRESH-carried criteria still appear in the ledger with their evidence + method, so the completion gate and the judge treat them like any grounded claim — the gate is satisfied honestly, not bypassed.
Report drift. After grounding, any criterion whose new verdict differs from its baseline verdict has DRIFTED — most importantly confirmed → contradicted, a regression. These are the headline of a drift run; surface them prominently in the handoff.
Run the pass below repeatedly until every enumerated claim has a definitive, evidence-cited verdict and the independent judge attests the verification is complete. Then stop. Announce each pass (e.g. "Pass 2").
Enumerate → Verify → Reconcile → Judge → Re-check ↺
A Stop hook blocks you from ending your turn until the ledger shows a structurally complete, judge-signed verification, so you cannot sign off after a shallow pass. It reads a ledger you maintain at:
/tmp/claude-verify-spec-${CLAUDE_SESSION_ID}.json
At the start of the run, and whenever a verdict or the judge result changes, write the ledger with the Write tool (overwrite it each time — that also keeps it fresh so the loop doesn't expire mid-run). Write strict, valid JSON exactly matching the schema below, using the live session id in the path (an un-expanded ${CLAUDE_SESSION_ID} literal would land the ledger where the hook can't see it and silently disable the gate). The hook validates it and blocks you with a correction message if it is malformed. The /tmp ledger is the ONLY file you may write — never edit code, the spec, or any repo file.
{
"target": "<what you are verifying — a path, feature, or commit range>",
"specPath": "<absolute path of the spec under verification — set it for a spec target so the drift baseline is keyed and written; omit otherwise>",
"claims": [
{
"claim": "short text of one checkable claim",
"verdict": "unchecked",
"evidence": "",
"method": "",
"disposition": ""
}
],
"judge": {
"ran": false,
"verdict": "pending",
"missed": [],
"weakEvidence": []
},
"backwardSweep": {
"ran": false,
"base": "",
"skippedReason": "",
"findings": [
{
"hunk": "file:line / path of a substantive change mapping to NO acceptance criterion",
"evidence": "git sha / file:line proving the change",
"disposition": "intended | unintended | unsure",
"proposedAC": "candidate AC text for the behavior this hunk introduces"
}
]
}
}
specPath is the absolute path of the spec when the target is a spec — set it so the hook can write the drift baseline for this verification on a clean pass (see Drift re-check). Omit it for a non-spec target (a PR range, a bare claim). It is shape-validated only and never gates.verdict is one of: unchecked (not yet verified) · confirmed (true — evidence must cite the source) · contradicted (false — evidence must cite the source and state the actual value) · unverifiable (cannot be grounded — disposition must record the user's call).evidence must be concrete ground truth: a file:line, a git show <sha> hunk, or read-only CLI output. Never the spec, an audit, a checklist, or your own prior output.method records how the claim was grounded (step 2's evidence standard): static-read, measurement, exhaustive-check, cli-observation, or test-run. Required for every confirmed/contradicted claim; it is what makes the evidence standard auditable (a measurable-threshold claim whose method is static-read is under-verified by construction — the judge flags it). Leave empty for unverifiable.judge records the independent judge's result (step 4): ran true once it has run, verdict one of pending / gaps / complete, missed = checkable claims it found were absent from the ledger, weakEvidence = confirmed/contradicted claims whose evidence it found hollow, stale, or doc-based.backwardSweep records the backward-coverage pass (step 2's backward direction): ran true once it has run, base the diff base swept, skippedReason why it was skipped (if it was — e.g. no determinable base when running autonomously), and findings the substantive hunks that map to no AC-id. Findings are report-only — they never block the stop (exactly like contradicted claims). Omit backwardSweep entirely when the target is not a spec implementation with a diff to walk.What the hook enforces vs. what you own. The hook mechanically requires: no claim left unchecked; every confirmed/contradicted claim cites non-empty evidence and records a non-empty method; every unverifiable claim has a disposition; and the judge ran with verdict: "complete" and empty missed/weakEvidence. It checks the method is present, never whether it fits the claim — that the standard matches what the AC asserts is the judge's call. It cannot see whether you enumerated every claim or whether a citation is genuine — that is the judge's job, which is why the judge's sign-off is itself gated. The hook is a backstop against a shallow stop, not a substitute for an honest verification. contradicted claims do not block the stop — they are the findings you report; backwardSweep.findings are the same — reported, never gate-blockers (the hook shape-validates the field but never gates on it; the judge attests the sweep actually ran). If the user redirects to unrelated work, delete the ledger file and stop.
From the target, extract every checkable assertion as a discrete claim: "added X", "removed Y", "the system now does Z", "endpoint A returns B", "table / column / route / config exists", "resource is named N", counts, and every "currently / now X" statement. Add them all to the ledger as unchecked. Be exhaustive — an unenumerated claim is an unverified claim, and missing one is worse than a slow pass.
If the target is a spec with an ## Acceptance Criteria section, seed the ledger with every AC-id as a claim first — each criterion is the contract the implementation is judged against, so a missing AC-id is the worst kind of missed claim. Carry the AC-id in the claim text. Then add any other checkable claims from the body. A criterion that grounds out contradicted is the exact "requirement built wrong / not built" finding this gate exists to surface. If the target genuinely has no checkable claims, say so plainly and stop (record one unverifiable claim noting that, dispositioned "no checkable claims in target") — don't invent claims to fill the ledger.
Dispatch parallel Explore subagents (the Task tool, subagent_type: Explore) to ground the claims — they are read-only and fast. Split the claims by area and scale the agent count to the size of the work. Require, for every claim, a verdict with cited ground-truth evidence:
For each claim return
confirmed/contradicted(state the actual value) /unverifiable, cite the evidence — afile:line, agit show <sha>/git diffhunk, or read-only CLI output — and record themethod(how you grounded it:static-read/measurement/exhaustive-check/cli-observation/test-run). You are forbidden from citing the spec, an audit, a checklist, or any doc as evidence — only real source. If you cannot ground it, returnunverifiableand say why. Do not speculate.
Ground truth, in order of authority: the codebase at branch HEAD; the git history (git log / git diff / git show on the working branch — re-ground against HEAD, since claims drift after out-of-band commits and merges); and, for infra / ops claims, live state via read-only CLI (e.g. aws … describe/list/get, gh api).
The strength of evidence a claim needs is a function of what it asserts — infer it from the claim's text, no type tags required. A bare code citation is not automatically sufficient:
method: measurement.grep/AST sweep proving the absence — not one representative citation that happens to comply → method: exhaustive-check.method: static-read / test-run.aws … get-public-access-block, never forced into a test suite → method: cli-observation.This is the real value of "typing" without the formality: it closes the gap where a perf or security constraint gets rubber-stamped by code-reading. Record the technique in each claim's method; the judge (step 4) flags any whose method falls short of what the assertion demands.
The claim grounding above is the forward direction (every claim/AC-id has evidence). When the target is a spec with an ## Acceptance Criteria section, also run the backward direction: every substantive change in the implementation diff should map to an owning AC-id. A hunk that maps to none is the finding this pass exists to surface — scope creep, silent reinterpretation, or a derived requirement (a real behavior built with no criterion). The forward judge can't see it, because it re-derives ACs forward and never looks at code with no AC.
git merge-base HEAD main then git diff <base>..HEAD; when you are on the trunk, diff against the upstream instead (git diff origin/main..HEAD — the unpushed commits). Record the base used in backwardSweep.base. If no base is determinable: on direct human invocation, ask with AskUserQuestion; running autonomously as the /goal done-gate, set backwardSweep.skippedReason and proceed — never block on it.description / version edits are docs (allowlist); dependency / entrypoint / script / permission edits are substantive. Docs or config that an AC explicitly governs (e.g. an AC "the README documents X") are not a backward finding — that artifact's coverage is a forward concern, checked as its own claim.backwardSweep.findings with its evidence, propose candidate AC text (proposedAC), and triage disposition: intended (→ "add this AC and re-run refine-spec") vs unintended (→ "remove or justify") vs unsure. This is always a non-blocking report — it never holds the gate, you edit nothing, and you never auto-reopen refine-spec. On direct human invocation you may use AskUserQuestion to confirm an ambiguous intent, but never as a gate dependency (the autonomous done-gate has no human to ask).Update each claim's verdict, evidence, and method (how it was grounded) in the ledger. For every contradicted claim, capture a precise discrepancy: claim → expected → actual → evidence (file:line / sha / CLI). For an unverifiable claim, dig further; if it genuinely can't be grounded (e.g. it depends on runtime state you can't observe), ask the user with AskUserQuestion and record their disposition. If many claims come back unverifiable, batch the dispositions into one AskUserQuestion rather than asking per claim. Never guess a verdict to clear the gate.
Also write the backward sweep into the ledger's backwardSweep: set ran true, record the base swept (or skippedReason), and list every unmapped substantive hunk in findings. These are reported, never blockers, and never edited away.
The agent that did the verifying does not get to declare it complete. Before you try to stop, dispatch a fresh verification judge — an independent Task subagent (subagent_type: Explore, so it can re-check source read-only) with no memory of your passes; hand it only the target and the ledger, not your reasoning. Instruct it to be adversarial: independently re-derive the checkable claims from the target and (a) list any that are missing from the ledger (missed) — and if the target has an Acceptance Criteria section, confirm every AC-id appears as a claim, treating any absent one as missed; (b) list any confirmed/contradicted claim whose cited evidence is hollow, stale, doc-based rather than real source, or below the standard the claim demands — a measurable threshold "verified" only by code-reading, a universal invariant by a single example, a method that doesn't match what the assertion needs (weakEvidence); and (c) return verdict: "complete" only if both lists are empty, else "gaps". Write its result into the ledger's judge field. Every entry in missed/weakEvidence becomes another pass.
Also have the judge attest the backward sweep (when the target is a spec implementation): that backwardSweep.ran is honest — it actually walked the diff against the right base, or recorded a legitimate skippedReason — and that no obviously substantive hunk was left out of findings. A backward gap the judge spots (an unmapped hunk that should have been a finding) is its own pass; but the findings themselves never block verdict: "complete" — they are reports, not unresolved claims.
If the judge reported gaps, loop: enumerate the missed claims, re-ground the weakEvidence ones, then re-run the judge. Before stopping, make the ledger reflect reality — every claim verdicted, every verdict cited, every unverifiable dispositioned, and the judge complete. The Stop hook bounces you back here if anything is still unchecked, uncited, undispositioned, or the judge hasn't signed off.
Finish only when all of these hold (the first four + the judge sign-off are hook-enforced; enumeration completeness and evidence authenticity are what the judge attests):
AC-id is present as a claim with a grounded verdict (a contradicted AC is a finding to report, not a blocker).unchecked.confirmed / contradicted verdict cites concrete ground-truth evidence (file:line / git sha / read-only CLI), never a doc.confirmed / contradicted claim records a method (how it was grounded), and that method meets the standard the claim asserts — a measurable threshold has a measurement, a universal invariant an exhaustive check (presence is hook-enforced; fit is judge-attested via weakEvidence).unverifiable claim carries an explicit user disposition.complete — no missed claims, no weak / doc-based evidence.backwardSweep.findings. These are reports, never gate-blockers (judge-attested, not hook-enforced).When the gate passes, report a per-claim verdict table — claim | verdict | method | evidence (file:line / sha / CLI) — plus a short discrepancy summary for every contradicted claim (expected vs actual). State plainly how many claims were confirmed vs contradicted vs unverifiable. When the target is a spec, render it as a compact coverage matrix keyed on AC-id — AC-id | verdict | method | evidence | checklist-item (the implementing Checklist task, from the spec's AC-id citations) — where an empty cell is itself the finding: an AC-id with no evidence is a forward gap, and one no Checklist item cites is a criterion nobody implemented. Keep the matrix generated from the ledger — never an author-maintained document. If the backward sweep ran, add a short backward-coverage section listing every unmapped substantive hunk with its proposed AC and triage (intended / unintended / unsure), or state "every change maps to an AC" when it found nothing (and note it if the sweep was skipped, with the reason). These proposed ACs are also recorded — by the Stop hook, to /tmp — as a spec-amendment handoff that refine-spec ingests on its next run, so the missed requirement reaches the spec without a manual re-key; verify-spec still edits nothing. If this was a drift run, lead the report with a Drift line — counts of FRESH (carried) vs re-grounded vs DRIFTED — and list every DRIFTED criterion as AC-id → baseline verdict → new verdict → evidence, calling out any confirmed → contradicted regression first; if there was no baseline, say so plainly (a full verification ran). Stop there — do not fix the discrepancies unless the user asks; surfacing them is the deliverable (enumerate, then let the user decide). The Stop hook clears the ledger automatically once the gate passes.
Bash and Write, but its contract is strictly read-only:
git log/diff/show, aws … describe/list/get, gh api GETs, grep, cat. Never run anything that writes, creates, deletes, applies, deploys, commits, pushes, or redirects (> / >>, sed -i, rm, mv, terraform apply, aws … create/put/delete, git commit/push/checkout)./tmp ledger. Never write or edit a repo file, the spec, or code. (The drift baseline and the spec-amendment handoff — both in /tmp, never the repo — are written by the Stop hook, never by you; you only read the drift baseline at run start.)refine-spec, and never a gate-blocker. Keep the noise filter tight: behavior-bearing changes only; refactors, formatting, tests, CI, config churn, and docs are allowlisted.contradicted claims means "hand back a bug list", not "go fix N things".npx claudepluginhub zakattack9/agentic-coding --plugin spec-opsCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.