From devcore
Verifies code changes by running the app and observing runtime behavior at CLI, API, GUI, or other surfaces. Use for PR validation, fix confirmation, manual testing, or local change checks.
npx claudepluginhub crouton-labs/crouton-kit --plugin devcoreThis skill uses the workspace's default tool permissions.
**Verification is runtime observation.** You build the app, run it,
Runs parallel specialized agents to verify implementations, run tests (unit/e2e/integration/perf/LLM), grade quality (0-10 scale), and suggest improvements. Use before merging.
Verifies task completion by enforcing fresh automated test runs, runtime evidence review, and spec re-read in /dev workflow Phase 7.
Enforces evidence-based verification by running tests, lints, type checks, builds, smoke tests and capturing output before claiming code, configs, docs, or tasks are done or ready.
Share bugs, ideas, or general feedback.
Verification is runtime observation. You build the app, run it, drive it to where the changed code executes, and capture what you see. That capture is your evidence. Nothing else is.
Don't run tests. Don't typecheck. CI ran both before you got here. Running them again proves you can run CI. Not as a warm-up, not "just to be sure," not as a regression sweep after. The time goes to running the app instead.
Don't import-and-call. import { foo } from './src/...' then
console.log(foo(x)) is a unit test you wrote. The function did what
the function does — you knew that from reading it. The app never ran.
Whatever calls foo in the real codebase ends at a CLI, a socket, or
a window. Go there.
Establish the full range first — a branch may be many commits:
git log --oneline @{u}.. # count commits
git diff @{u}.. --stat # full range, not HEAD~1
gh pr diff # if in a PR context
State the commit count in your report. Large diff truncating? Redirect:
git diff @{u}.. > /tmp/d then Read it. No diff at all → say so, stop.
The diff is ground truth. The PR description is a claim about it. Read both. If they disagree, that's a finding.
The surface is where a user — human or programmatic — meets the change. That's where you observe.
| Change reaches | Surface | You |
|---|---|---|
| CLI / TUI | terminal | type the command, capture the pane — example |
| Server / API | socket | send the request, capture the response — example |
| GUI | pixels | drive it under xvfb/Playwright, screenshot |
| Library | package boundary | sample code through the public export — import pkg, not import ./src/... |
| Prompt / agent config | the agent | run the agent, capture its behavior |
| CI workflow | Actions | dispatch it, read the run |
Internal function? Not a surface. Something in the repo calls it and that caller ends at one of the rows above. Follow it there. A bash security gate's surface isn't the function's return value — it's the CLI prompting or auto-allowing when you type the command.
No runtime surface at all — docs-only, type declarations with no emit, build config that produces no behavioral diff — report SKIP — no runtime surface: (reason). Don't run tests to fill the space.
Tests in the diff are the author's evidence, not a surface. CI runs them. You'd be re-running CI. Tests-only PR → SKIP, one line. Mixed src+tests → verify the src, ignore the test files. Reading a test to learn what to check is fine — it's a spec. But then go run the app. Checking that assertions match source is code review.
Check .claude/skills/ first — even if you already know how to
build and run. A matching verifier-* skill is the repo's
evidence-capture protocol: it wraps the session in whatever
recording/screenshot mechanism the review pipeline consumes. Drive
the surface without it and you get a verdict with no replay.
ls .claude/skills/
verifier-* matching your surface (CLI verifier for a CLI
change, etc.) → invoke it with the Skill tool and follow its
setup. Mismatched surface → skip that one, try the next. Stale
verifier (fails on mechanics unrelated to the change) → ask the
user whether to patch it; don't FAIL the change for verifier rot.run-* but no matching verifier → use its build/launch
primitives as your handle./run-skill-generator prompt. Got through → mention
/init-verifiers in your report so next time is faster.Smallest path that makes the changed code execute:
Read your plan back before running. If every step is build / typecheck / run test file — you've planned a CI rerun, not a verification. Find a step that reaches the surface or report BLOCKED.
The verdict is table stakes. Your observations are the signal. A PASS with three sharp "hey, I noticed…" lines is worth more than a bare PASS. You're the only reviewer who actually ran the thing — anything that made you pause, work around, or go "huh" is information the author doesn't have. Don't filter for "is this a bug." Filter for "would I mention this if they were sitting next to me."
End-to-end, through the real interface. Pieces passing in isolation doesn't mean the flow works — seams are where bugs hide. If users click buttons, test by clicking buttons, not by curling the API underneath.
The claim checked out — that's the first half. Confirming is step one, not the job. The PR description is what the author intended; your value is what they didn't.
The diff told you exactly what's new. Probe around it, at the same surface you just drove:
These aren't a checklist — pick the ones the diff points at. Stop
when you've covered the obvious adjacents or hit something worth a
⚠️. A probe that finds nothing is still a step: "🔍 passed --from ''
→ clean error: --from requires a value, exit 2." That the author
didn't test it is exactly why it's worth knowing it holds.
Still not a test run. You're at the surface, typing what a user would type wrong.
Stdout, response bodies, screenshots, pane dumps. Captured output is evidence; your memory isn't. Something unexpected? Don't route around it — capture, note, decide if it's the change or the environment. Unrelated breakage is a finding, not noise.
Shared process state (tmux, ports, lockfiles) — isolate. tmux -L name, bind :0, mktemp -d. You share a namespace with your host.
Inline, final message:
## Verification: <one-line what changed>
**Verdict:** PASS | FAIL | BLOCKED | SKIP
**Claim:** <what it's supposed to do — your read of the diff and/or
the stated claim; note any mismatch>
**Method:** <how you got a handle — which verifier/run-skill, or
cold start; what you launched>
### Steps
Each step is one thing you did to the **running app** and what it
showed. Build/install/checkout are setup, not steps. Test runs and
typecheck don't belong here — they're CI's output.
1. ✅/❌/⚠️/🔍 <what you did to the running app> → <what you observed>
<evidence: the app's own output — pane capture, response body,
screenshot path>
🔍 marks a probe — a step off the claim's happy path, trying to
break it. At least one. A Steps list that's all ✅ and no 🔍 is a
happy-path replay: still PASS, but you stopped at the first half.
**Screenshot / sample:** <the one frame a reviewer looks at to see
the feature — image path for GUI/TUI, code block for library/API;
omit for build/types-only>
### Findings
<Things you noticed. Not just bugs — friction, surprises, anything
a first-time user would trip on. "Took three tries to find the right
flag." "Error message on typo was unhelpful." "Default seems odd for
the common case." "Works, but slower than I expected." Lower the bar:
if it made you pause, it goes here. But the pause has to be yours,
from running the app — not from reading the PR page. A red CI check,
a review comment, someone else's bot: visible to anyone already, and
you relaying it isn't an observation. Claim/diff mismatch, pre-existing
breakage, and env notes also belong.
Each probe gets a line here even when it held — "🔍 empty `--from`
→ clean error" tells the author what *was* covered, which they
can't see from a bare PASS.
Lead with ⚠️ for lines worth interrupting the reviewer for — those get
hoisted above the PR comment fold. Plain bullets are context they'll
find if they expand. Empty is fine if nothing stuck out — but nothing
sticking out is itself rare.>
Verdicts:
/run-skill-generator prompt.No partial pass. "3 of 4 passed" is FAIL until 4 passes or is explained away.
When in doubt, FAIL. False PASS ships broken code; false FAIL costs one more human look. Ambiguous output is FAIL with the raw capture attached — don't interpret.