From rune
Validates AI agent claims like 'tests pass' or 'fixed' against evidence trails and tool outputs. Detects stubs and unproven assertions. Auto-triggers at workflow end.
npx claudepluginhub rune-kit/rune --plugin @rune/analyticsThis skill uses the workspace's default tool permissions.
The lie detector for agent claims. Validates that what an agent says it did actually happened — with evidence. Catches the #1 failure mode in AI coding: claiming completion without proof.
Verifies task completion by enforcing fresh automated test runs, runtime evidence review, and spec re-read in /dev workflow Phase 7.
Mandates fresh execution of verification commands (tests, builds, lints) and output review before any completion claims like 'done', 'tests pass', or 'fixed'.
Enforces running verification commands (tests, linters, builds) before claiming work complete/fixed/passing or before commits/PRs. Ensures evidence-based assertions.
Share bugs, ideas, or general feedback.
The lie detector for agent claims. Validates that what an agent says it did actually happened — with evidence. Catches the #1 failure mode in AI coding: claiming completion without proof.
Every claim requires evidence. No evidence = UNCONFIRMED = BLOCK. "I ran the tests and they pass" without stdout = UNCONFIRMED. "I fixed the bug" without before/after diff = UNCONFIRMED. "Build succeeds" without build output = UNCONFIRMED.cook in Phase 5d (quality gate)team before merging stream resultsNone — pure validator. Reads evidence, produces verdict.
cook (L1): Phase 5d — validate completion claims before committeam (L1): validate cook reports from parallel streamsParse the agent's output for completion claims. Common claim patterns:
CLAIM PATTERNS:
"tests pass" / "all tests passing" / "test suite green"
"build succeeds" / "build complete" / "compiles clean"
"no lint errors" / "lint clean"
"fixed" / "resolved" / "bug is gone"
"implemented" / "feature complete" / "done"
"no security issues" / "sentinel passed"
Extract each claim as: { claim: string, source_skill: string }
Before checking claims, scan all files created/modified in this workflow for stubs:
Grep for stub patterns in new/modified files:
- "Placeholder" | "TODO" | "Not implemented" | "NotImplementedError"
- Functions with body: only `return null` / `return {}` / `pass` / `throw`
- Components returning only a single div with no logic
If ANY stub detected:
If the skill that just ran has a ## Self-Validation section, extract its checklist and treat each item as an implicit claim:
For each Self-Validation check in the skill's SKILL.md:
1. Read the check (e.g., "at least one assertion per test")
2. Look for evidence in tool output that this check was satisfied
3. If evidence found → add as CONFIRMED claim
4. If no evidence → add as UNCONFIRMED claim ("Self-Validation: [check] — no evidence")
Why: Self-Validation catches domain-specific quality issues that generic claim matching (Step 2) cannot detect. A test skill knows "no assertions = useless test" but completion-gate doesn't — unless the skill's Self-Validation tells it to check.
If a skill has Self-Validation and ANY check is UNCONFIRMED or CONTRADICTED → overall verdict cannot be CONFIRMED, even if all explicit claims pass.Before validating claims, audit the agent's tool call pattern for execution loops that indicate the agent was stuck but didn't report it:
Classify the agent's tool calls from this workflow into two categories:
| Category | Tools | Expected in Phase 4 |
|---|---|---|
| Observation | Read, Grep, Glob, Bash(grep/ls/cat) | <40% of calls |
| Effect | Write, Edit, Bash(build/test/npm) | >60% of calls |
Loop patterns to detect:
| Pattern | Detection | Verdict Impact |
|---|---|---|
| Observation chain: 6+ consecutive observation tools in Phase 4 | Count longest observation-only streak | Add WARN: "Agent had {N}-call observation streak during implementation — possible analysis paralysis" |
| Low effect ratio: <20% effect calls during Phase 4 | effect_calls / total_calls | Add WARN: "Only {X}% of Phase 4 calls were writes — agent may have been stuck" |
| Repeating tool pattern: Same tool+args called 3+ times | Hash tool+args, count duplicates | Add WARN: "Agent called {tool}({args}) {N} times — possible loop" |
| Budget overrun: Phase 4 exceeded 50 tool calls for a single-file task | Count Phase 4 calls vs files changed | Add WARN: "50+ tool calls for {N} files changed — disproportionate effort" |
Scoring impact: Loop warnings don't change individual claim verdicts but ARE included in the Completion Gate Report under a new ### Execution Efficiency section. This gives the calling orchestrator signal about whether the agent's process was healthy, not just whether the output was correct.
Skip if: Nano/Fast rigor — not enough tool calls to meaningfully analyze.
For each claim, look for corresponding evidence in the conversation context:
| Claim Type | Required Evidence | Where to Find |
|---|---|---|
| "tests pass" | Test runner stdout with pass count | Bash output from test command |
| "build succeeds" | Build command stdout showing success | Bash output from build command |
| "lint clean" | Linter stdout (even if empty = 0 errors) | Bash output from lint command |
| "fixed" | Git diff showing the change + test proving fix | Edit/Write tool calls + test output |
| "implemented" | Files created/modified matching the plan | Write/Edit tool calls vs plan |
| "no security issues" | Sentinel report with PASS verdict | Sentinel skill output |
| "coverage ≥ X%" | Coverage tool output with actual percentage | Test runner with coverage flag |
For each claim + evidence pair:
IF evidence exists AND evidence supports claim:
→ CONFIRMED
IF evidence exists BUT contradicts claim:
→ CONTRADICTED (most serious — agent is wrong)
IF no evidence found:
→ UNCONFIRMED (agent may be right but didn't prove it)
3-Axis verification — categorize each claim into one of three axes, then ensure all axes are covered:
| Axis | Question | Example Claims |
|---|---|---|
| Completeness | Were all planned tasks done? All specs implemented? | "implemented feature X", "all TODO items done", "migration created" |
| Correctness | Does output match spec intent? Do tests verify real behavior? | "tests pass", "build succeeds", "lint clean", "fixed the bug" |
| Coherence | Does it follow project patterns? Consistent with existing code? | "follows conventions", "uses existing patterns", "no new deps needed" |
If an axis has ZERO claims → flag as gap: "No [Completeness/Correctness/Coherence] evidence found — agent may have skipped this dimension."
Adversarial validation checklist (run AFTER initial verdicts):
## Completion Gate Report
- **Status**: CONFIRMED | UNCONFIRMED | CONTRADICTED
- **Claims Checked**: [count]
- **Confirmed**: [count] | **Unconfirmed**: [count] | **Contradicted**: [count]
### Claim Validation
| # | Claim | Evidence | Verdict |
|---|---|---|---|
| 1 | "All tests pass" | Bash: `npm test` → "42 passed, 0 failed" | CONFIRMED |
| 2 | "Build succeeds" | No build command output found | UNCONFIRMED |
| 3 | "No lint errors" | Bash: `npm run lint` → "3 errors" | CONTRADICTED |
### Gaps (if any)
- Claim 2: Re-run `npm run build` and capture output
- Claim 3: Agent claimed clean but lint shows 3 errors — fix required
### Verdict
UNCONFIRMED — 1 claim lacks evidence, 1 contradicted. Cannot proceed to commit.
When validating a completed phase in a multi-phase plan, check for integration gaps between phases:
Orphaned exports — files/functions created in this phase that claim to be used by future phases (see ## Cross-Phase Context → Exports) but are not yet importable:
Grep for the export name in the current codebase:
- If export exists AND is importable → CONFIRMED
- If export exists but has wrong signature vs phase file contract → CONTRADICTED
- Expected export missing entirely → UNCONFIRMED ("Phase N claims to export X but X not found")
Uncalled routes — API endpoints added in this phase but not wired to any frontend/consumer yet:
Auth gaps — new endpoints or pages without authentication/authorization:
Grep for route handlers without auth middlewareE2E flow trace — for the primary user flow this phase enables:
This step is OPTIONAL for single-phase tasks and MANDATORY for multi-phase master plans.
Before emitting verdict, verify evidence quality:
| Evidence Quality | Verdict |
|---|---|
| Exit code 0 only, no output read | INSUFFICIENT — re-run and read output |
| Output read but no quote matched to claim | UNCONFIRMED — cite specific evidence |
| Quote matches claim exactly | CONFIRMED |
| Quote contradicts claim | CONTRADICTED |
When validating a phase within a master plan, diff actual changes against the phase plan file:
Glob for .rune/plan-*-phase*.md matching the current phase## Files Touched — build a list of expected files (new/modify/delete)## Tasks — build a list of all - [ ] and - [x] itemsgit diff --name-only (or file system scan)| Check | Status |
|---|---|
Unchecked task in phase plan (- [ ] still exists) | INCOMPLETE — task was not done |
| File in plan's "Files Touched" but not in actual diff | MISSING — planned file was never touched |
| File in actual diff but NOT in plan's "Files Touched" | UNPLANNED — scope creep (warn, not block) |
All tasks [x] AND all planned files touched | PLAN-ALIGNED |
Plan Diff: PLAN-ALIGNED | INCOMPLETE (2 unchecked tasks) | MISSING (1 file never touched)
Skip if: No active phase plan found (single-task, no master plan). MANDATORY for multi-phase master plans.
ALL claims CONFIRMED → overall CONFIRMED (proceed)
ANY claim CONTRADICTED → overall CONTRADICTED (BLOCK — fix the contradiction)
ANY claim UNCONFIRMED → overall UNCONFIRMED (BLOCK — provide evidence)
(no CONTRADICTED)
Completion Gate Report with status (CONFIRMED/UNCONFIRMED/CONTRADICTED), claim validation table, gaps, and verdict. See Step 4 Report above for full template.
| Failure Mode | Severity | Mitigation |
|---|---|---|
| Agent rephrases claim to avoid detection | MEDIUM | Pattern matching covers common phrasings — extend as new patterns emerge |
| Evidence from a DIFFERENT test run (stale) | HIGH | Check that evidence timestamp/context matches current changes |
| Agent pre-generates evidence by running commands proactively | LOW | This is actually GOOD behavior — we want agents to provide evidence |
| Completion-gate itself claims "all confirmed" without evidence | CRITICAL | Gate report MUST include the evidence table — no table = report is invalid |
| Existence Theater — agent creates files but they're stubs | HIGH | Step 1b stub detection: grep for Placeholder/TODO/NotImplementedError in new files |
| Cross-phase integration gaps — exports exist but wrong signature | HIGH | Step 4.5: verify exports match Code Contracts from phase file |
| Phase complete but E2E flow broken — missing link in the chain | MEDIUM | Step 4.5 E2E flow trace: entry → logic → data → response must all be connected |
| Rubber-stamping — all CONFIRMED without scrutiny | HIGH | Default-FAIL mindset: actively seek 3-5 issues. Zero issues = red flag, apply skeptic sweep on weakest 2 claims |
| Partial completion claimed as full — 80% done but "implemented" | HIGH | Adversarial checklist: check for partial completion, scope mismatch, evidence-claim alignment |
| Self-Validation skipped — skill has checks but gate ignores them | HIGH | Step 1c: extract Self-Validation from skill's SKILL.md, treat each as implicit claim. Missing = UNCONFIRMED |
| Plan says done but phase file has unchecked tasks | HIGH | Step 5.5: diff changed files vs phase plan's Files Touched + Tasks sections |
| Agent stuck in observation loop but claims "implemented" | HIGH | Step 1d: Execution Loop Audit detects low effect ratio and observation chains — flags in report even if claims pass |
~500-1000 tokens input, ~200-500 tokens output. Haiku for speed. Runs frequently as part of cook's quality phase.