From harness-claude
Performs 3-level deep verification (EXISTS, SUBSTANTIVE, WIRED) of completed tasks with fresh evidence only. Use for milestones, PRs, or post-quick-gate audits.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> 3-level evidence-based verification. No completion claims without fresh evidence. "Should work" is not evidence.
Verifies feature completion by writing automated tests against SPEC.md, running commands for fresh evidence, and confirming outputs per Iron Law of Verification.
Enforces evidence-based verification by running tests, lints, type checks, builds, smoke tests and capturing output before claiming code, configs, docs, or tasks are done or ready.
Runs lint, type-check, tests, and build checks for Node.js/TS, Python, Rust, Go, Java projects to verify code health after changes.
Share bugs, ideas, or general feedback.
3-level evidence-based verification. No completion claims without fresh evidence. "Should work" is not evidence.
on_commit or on_pr triggers fire and verification is needed| Tier | Skill | When | What |
|---|---|---|---|
| Quick gate | harness-execution (built-in) | After every task | test + lint + typecheck + build + harness validate |
| Deep audit | harness-verification (this) | Milestones, PRs, on-demand | EXISTS -> SUBSTANTIVE -> WIRED |
Use deep audit for milestone boundaries, before PRs, or when the quick gate passes but something feels wrong. Do NOT invoke after every individual task.
No completion claim may be made without fresh verification evidence collected in THIS session.
Cached results, remembered outcomes, and "it worked last time" are not evidence. Run the checks. Read the output. Report what you observed.
The words "should", "probably", "seems to", and "I believe" are forbidden in verification reports. Replace with "verified: [evidence]" or "not verified: [what is missing]."
When invoked by autopilot (or with explicit arguments), resolve paths before starting:
session-slug argument provided, set {sessionDir} = .harness/sessions/<session-slug>/. Pass to gather_context({ session: "<session-slug>" }). All handoff writes go to {sessionDir}/handoff.json.When no arguments are provided (standalone invocation), session slug is unknown — omit from gather_context, fall back to global .harness/ paths.
When you encounter an unknown during verification, classify it immediately:
Do not mark PASS with unstated assumptions. An assumption-laden PASS is a false positive.
Review-never-fixes: Verification identifies gaps. Verification never fills them. If you find a stub, missing test, or unwired artifact, record it as a FAIL with evidence. Do not implement the fix — that is the executor's job. A verifier who fixes is no longer verifying independently.
Before running verification levels, load session context:
gather_context({
path: "<project-root>",
intent: "Verify phase deliverables",
skill: "harness-verification",
session: "<session-slug-if-provided>",
include: ["state", "learnings", "validation"]
})
If a session slug is known, include the session parameter to scope reads to .harness/sessions/<slug>/. Otherwise omit it to fall back to .harness/. Use returned learnings to check for known failures relevant to artifacts being verified.
For every artifact that was supposed to be created or modified:
ls, stat, or read. Do not assume it exists because you wrote it -- writes can fail silently.[EXISTS: PASS] path/to/file.ts (247 lines)
[EXISTS: FAIL] path/to/missing-file.ts -- file not found
Do not proceed to Level 2 until all Level 1 checks pass. Missing files must be created first.
For every artifact that passed Level 1:
TODO/FIXME comments, throw new Error('not implemented'), () => {}, return null/undefined/{} as sole logic, pass (Python), placeholder/stub/mock in non-test code, functions with only a descriptive comment, interfaces defined but never implemented.[SUBSTANTIVE: PASS] path/to/file.ts -- real implementation, no stubs
[SUBSTANTIVE: FAIL] path/to/file.ts -- TODO on line 34, empty handler on line 67
Do not proceed to Level 3 until all Level 2 checks pass. Stubs must be replaced with real implementations first.
For every artifact that passed Level 2:
.skip/xit).harness validate and verify integration with project constraints.[WIRED: PASS] path/to/file.ts -- imported by 3 files, tested in file.test.ts (4 tests, all pass)
[WIRED: FAIL] path/to/file.ts -- exported but not imported by any other file
Run across all changed files as a final check:
Markers: TODO, FIXME, XXX, HACK, PLACEHOLDER, NOT_IMPLEMENTED
Code: () => {}, return null (sole body), pass, raise NotImplementedError
Tests: .skip, xit, xdescribe, @pytest.mark.skip, pending
Any match is a verification failure. Fix it or document why it is acceptable (e.g., "TODO tracked in issue #123, out of scope").
After all three levels, produce a structured gap report:
## Verification Report
### Level 1: EXISTS
- [PASS] path/to/file-a.ts (120 lines)
- [FAIL] path/to/file-c.ts -- not found
### Level 2: SUBSTANTIVE
- [PASS] path/to/file-a.ts -- real implementation
- [FAIL] path/to/file-b.ts -- TODO on line 22
### Level 3: WIRED
- [PASS] path/to/file-a.ts -- imported, tested, harness passes
- [NOT CHECKED] path/to/file-b.ts -- blocked by Level 2 failure
### Anti-Pattern Scan
- path/to/file-b.ts:22 -- TODO: implement validation
### Gaps
1. path/to/file-c.ts must be created
2. path/to/file-b.ts:22 must be implemented (not stub)
### Verdict: INCOMPLETE -- 2 gaps must be resolved
Use severity markers for critical findings: **[CRITICAL]** for blocking issues, **[IMPORTANT]** for non-blocking concerns.
After producing the report, request acceptance:
emit_interaction({
path: "<project-root>",
type: "confirmation",
confirmation: {
text: "Verification report: <VERDICT>. Accept and proceed?",
context: "<N artifacts checked, N gaps found>",
impact: "Accepting proceeds to code review. Declining requires gap resolution.",
risk: "<low if PASS, high if gaps remain>"
}
})
Write handoff to the session-scoped path when session slug is known, otherwise fall back to global:
.harness/sessions/<session-slug>/handoff.json.harness/handoff.json[DEPRECATED] Writing to
.harness/handoff.jsonis deprecated. In autopilot sessions, always write to.harness/sessions/<slug>/handoff.jsonto prevent cross-session contamination.
{
"fromSkill": "harness-verification",
"phase": "COMPLETE",
"summary": "<verdict summary>",
"artifacts": ["<verified file paths>"],
"verdict": "pass | fail",
"gaps": ["<gap descriptions if any>"]
}
Session summary (if session known): Update via writeSessionSummary with skill, status (Verification <PASS|FAIL|INCOMPLETE>. <N> artifacts, <N> gaps.), keyContext, and nextStep.
If verdict is PASS: Call emit_interaction:
{
"type": "transition",
"transition": {
"completedPhase": "verification",
"suggestedNext": "review",
"reason": "Verification passed at all 3 levels",
"artifacts": ["<verified file paths>"],
"requiresConfirmation": false,
"summary": "Verification passed: <N> artifacts. EXISTS, SUBSTANTIVE, WIRED all passed.",
"qualityGate": {
"checks": [
{ "name": "level1-exists", "passed": true },
{ "name": "level2-substantive", "passed": true },
{ "name": "level3-wired", "passed": true },
{ "name": "anti-pattern-scan", "passed": true },
{ "name": "harness-validate", "passed": true }
],
"allPassed": true
}
}
}
Immediately invoke harness-code-review without waiting for user input.
If verdict is FAIL or INCOMPLETE: Do NOT emit a transition. Surface gaps for resolution. Handoff records gaps for future reference.
When verifying a bug fix:
git stash or comment out)git stash pop or uncomment)If step 4 passes (test does not fail without fix), the test is invalid. Rewrite it.
Every pass/fail assertion in the verification report MUST include concrete evidence. "Should", "probably", "seems to" are forbidden by the Iron Law -- this defines HOW to cite.
Every verification claim MUST use one of:
file:line with observed content (e.g., src/services/user-service.ts:42 -- "create method validates email")harness validate and harness check-deps outputevidence session section per level via manage_stateWhen to cite: At every level. L1 cites file reads. L2 cites specific line content. L3 cites imports, test output, harness output. Each [PASS]/[FAIL] marker must be accompanied by producing evidence.
Uncited claims: Any verification assertion without direct evidence is a verification failure. This skill does not use [UNVERIFIED] -- if evidence cannot be produced, verdict is FAIL or INCOMPLETE.
Verification checklists passed to subagents or used internally MUST use compressed single-line format. Each check is one line with pipe-delimited fields:
level|check-name|pass-criterion
Example (Level 2 SUBSTANTIVE rubric):
L2|no-stubs|No TODO/FIXME/throw-not-implemented in production code
L2|no-empty-bodies|No empty function bodies, () => {}, or return null as sole logic
L2|spec-complete|All behaviors specified in spec have corresponding implementation
L2|real-logic|Functions contain meaningful logic, not just hardcoded returns
Why: Verbose checklist prose inflates verification context without improving accuracy. Dense single-line rubrics give the same signal in fewer tokens, leaving more budget for reading and analyzing actual file content.
Rules:
| Flag | Corrective Action |
|---|---|
| "Tests passed earlier, so I just need to check the files exist" | STOP. Iron Law: fresh evidence in THIS session. "Earlier" is cached — run the checks now. |
| "The implementation looks substantive at a glance" | STOP. Level 2 requires thorough reading, not glancing. Stubs designed to look real (e.g., functions with only a log statement) are the whole reason L2 exists. |
| "The artifact is exported so it must be wired" | STOP. Export without import is dead code. Trace the actual usage chain: import, call, test, pass. "Must be" is not evidence. |
// stubbed for now or // implementation pending in production code | STOP. These are Level 2 failures. Do not proceed to Level 3. Do not fix them yourself — record as FAIL and report. |
Mechanical checks (tests, lint, types) are binary pass/fail. No tolerance.
For behavioral verification (convention adherence, style guides), accept threshold-based results: 4/5 runs = pass, 2/5 runs = fail. If a convention fails >40% of the time, the convention needs rewriting -- blame the instruction, not the executor.
| Rationalization | Reality |
|---|---|
| "Tests passed earlier, no need to re-run" | Iron Law forbids cached results. All evidence must be fresh in THIS session. |
| "File exists and has code, skip thorough read for Level 2" | Level 2 requires thorough reading. Scanning for TODO, throw Error, empty functions catches stubs that look real. |
| "Artifact is imported by a test file, so passes Level 3" | Import is necessary but not sufficient. Test must assert on behavior and not be skipped. |
| "Verification report probably looks fine from memory" | "Should", "probably", "seems to", "I believe" are forbidden. Replace with "verified: [evidence]" or "not verified: [missing]." |
| "I found a stub so I'll quickly implement it to make verification pass" | Verification identifies gaps — verification never fills them. Record the stub as a FAIL. The executor fixes it. A verifier who implements is no longer independent. |
| "The spec only mentions 3 behaviors but I'll verify 5 to be thorough" | Verify what the spec requires, not what you think it should require. Extra verification against unstated requirements conflates verification with spec review. |
Task: "Create UserService with CRUD operations."
## Verification Report
### Level 1: EXISTS
- [PASS] src/services/user-service.ts (189 lines)
- [PASS] src/services/user-service.test.ts (245 lines)
- [PASS] src/services/index.ts (updated -- exports UserService)
### Level 2: SUBSTANTIVE
- [PASS] user-service.ts -- 4 CRUD methods with validation, error handling, DB calls
- [PASS] user-service.test.ts -- 12 tests: happy paths, errors, edge cases (none skipped)
### Level 3: WIRED
- [PASS] user-service.ts -- imported by src/api/routes/users.ts, 12 tests pass
- [PASS] harness validate -- passes
- [PASS] harness check-deps -- no boundary violations
### Anti-Pattern Scan
- No matches
### Verdict: COMPLETE -- all artifacts verified at all levels
harness validate or harness check-deps cannot run, this is blocking. Do not skip -- fix tooling or escalate.gather_context -- Load session-scoped state, learnings, and validation before Level 1. session parameter scopes to session directory.harness validate -- Run during Level 3 (WIRED) to verify artifact integration.harness check-deps -- Run during Level 3 (WIRED) to verify dependency boundaries.harness check-docs -- Verify documentation updated for new artifacts. Missing docs for public APIs is a gap.check_traceability -- Run during Level 3 (WIRED) to verify every requirement in the spec maps to at least one implemented artifact and test.validate_cross_check -- Run after Level 3 against the plan to verify implementation matches planned file map and task descriptions.check_phase_gate -- Run before producing the verification report to validate the current phase meets all gate criteria before marking complete.detect_anomalies -- Run during the Anti-Pattern Scan to identify structural inconsistencies (orphaned files, missing tests, unusual coupling) introduced during execution.emit_interaction -- Auto-transition to harness-code-review on PASS verdict only..harness/sessions/<slug>/ contains handoff.json, state.json, artifacts.json (spec path, plan path, file lists from execution). Do not write to global .harness/handoff.json when session slug is known.All commands must be run fresh. Do not rely on results from previous sessions or runs.
After verification, append a tagged learning: YYYY-MM-DD [skill:harness-verification] [outcome:pass/fail]: Verified [feature]. [Brief note.]