Foundational test review methodology. Loaded by /auditing-python-tests and /auditing-typescript-tests, not invoked directly.
From legacynpx claudepluginhub outcomeeng/claude --plugin legacyThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
THE ADVERSARIAL QUESTION:
How could these tests pass while the assertion remains unfulfilled?
If you can answer that question, the tests are REJECTED.
</objective><quick_start>
PREREQUISITE: Reference /testing for methodology (5 stages, 5 factors, 7 exceptions).
This is a foundational skill. Language-specific skills (/auditing-python-tests, /auditing-typescript-tests) load this first and add language-specific phases.
Review protocol has 4 foundational phases — stop at first rejection:
Language-specific skills add:
When reporting findings, cite source skills:
</quick_start>
<verdict> There is no middle ground. No "mostly good." No "acceptable with caveats."A missing comma is REJECT. A philosophical disagreement about test structure is REJECT. If it's not APPROVED, it's REJECT.
</verdict> <context> This skill protects the test suite from phantom evidence. A single evidentiary gap means CI can go green while promised assertions remain unfulfilled. The cost of false approval is infinite; the cost of false rejection is rework. </context><review_protocol> Execute these phases IN ORDER. Stop at first REJECT.
<phase name="spec_structure_validation"> For each assertion in the spec, verify:1.1 Assertion Format
Assertions MUST use one of five typed formats. No code in specs.
| Type | Quantifier | Test strategy | Format pattern |
|---|---|---|---|
| Scenario | There exists (this case works) | Example-based | Given ... when ... then ... ([test](...)) |
| Mapping | For all over finite set | Parameterized | {input} maps to {output} ([test](...)) |
| Conformance | External oracle | Tool validation | {output} conforms to {standard} ([test](...)) |
| Property | For all over type space | Property-based | {invariant} holds for all {domain} ([test](...)) |
| Compliance | ALWAYS/NEVER rules | Review or test | ALWAYS/NEVER: {rule} ([review]/[test](...)) |
<!-- ✅ CORRECT: Typed assertions with inline test links -->
### Scenarios
- Given a parser configured for strict mode, when invalid input is provided, then a ParseError is raised ([test](tests/test_parser_unit.ext))
### Properties
- Serialization is deterministic: same input always produces the same output ([test](tests/test_serialize_unit.ext))
### Compliance
- ALWAYS: signal writes use non-blocking assignment — PDR-12 two-phase tick ([review](../../12-simulation-execution.pdr.md))
<!-- ❌ REJECT: Code in spec -->
def test_parser():
parser = Parser(strict=True)
...
If spec contains code examples: REJECT. Specs are durable; code drifts.
Assertion type must match test strategy:
| Assertion Type | Required Test Pattern | REJECT if |
|---|---|---|
| Scenario | Example-based tests | Missing concrete inputs/outputs |
| Mapping | Parameterized tests | Only example-based (not all cases covered) |
| Property | Property-based framework | Only example-based (must use property-based) |
| Conformance | Tool validation | Manual checks instead of tool |
| Compliance | [review] or [test] tag | No tag indicating verification method |
1.2 Test File Linkage
Inline test links are contractual. Every ([test](...)) link in an assertion must resolve to an actual file. Stale links = REJECT.
Specs may use either format:
([test](tests/test_parser_unit.ext)) embedded in assertionsBoth are contractual — every link must resolve.
This is distinct from the Analysis section (stories only), which documents the agent's codebase examination. Analysis references may diverge from implementation as understanding deepens — do NOT reject specs for stale Analysis references.
Check:
[test](path) or [display](path)# Verify linked files exist (extract paths from inline links or tables)
ls -la {container}/tests/{linked_file}
If link is broken or file missing: REJECT.
1.3 Level Appropriateness
Evidence lives at specific levels. Verify each assertion is tested at the correct level:
| Evidence Type | Minimum Level | Example |
|---|---|---|
| Pure computation/algorithm | 1 | Protocol timing, math correctness |
| Component interaction | 2 | TX→RX loopback, multi-entity simulation |
| Project-specific binary | 2 | Verilator lint, external tool invocation |
| Real credentials/services | 3 | Cloud APIs, payment providers |
If assertion is tested at wrong level: REJECT.
If story-level assertion appears in feature spec: Note as structural issue (stories should be created), but continue review.
GATE 1: Before proceeding to Phase 2, verify:
ls for each)If any check fails, STOP and REJECT with detailed findings.
</phase> <phase name="evidentiary_integrity"> For each test file, verify it provides genuine evidence.2.1 The Adversarial Test
Ask: How could this test pass while the assertion remains unfulfilled?
| Scenario | Verdict |
|---|---|
| Test asserts something other than what assertion specifies | REJECT |
| Test uses hardcoded values that happen to match | REJECT |
| Test doesn't actually exercise the code path | REJECT |
| Test mocks the thing it's supposed to verify | REJECT |
| Test can pass with broken implementation | REJECT |
2.2 Dependency Availability
CRITICAL: Missing dependencies MUST FAIL, not skip.
Search for silent skip patterns (use language-specific grep patterns from the language skill).
Evaluate each skip:
| Pattern | Verdict |
|---|---|
| Skip on required project dependency | REJECT - Required dependency must fail |
| Skip on test infrastructure (property lib) | REJECT - Test infrastructure must be present |
Skip on platform (sys.platform, os.type) | REVIEW - May be legitimate |
| Skip on CI environment variable | REVIEW - What is being skipped? |
The Silent Skip Problem:
Tests that silently skip on required dependencies allow CI to go green with zero verification. This is evidentiary fraud.
If tests silently skip on required dependencies: REJECT.
2.3 Harness Verification
If assertion specifies a harness:
spx/ or project test infrastructure)If harness is referenced but doesn't exist or isn't tested: REJECT.
GATE 2: Before proceeding to Phase 3, verify:
If any check fails, STOP and REJECT with detailed findings.
</phase> <phase name="lower_level_assumptions"> Features assume stories have tested what can be tested at story level. Capabilities assume features have done their job.3.1 Check for Lower-Level Specs
# For a feature, check if stories exist
ls -d {feature_path}/*-*.story/ 2>/dev/null
# For a capability, check if features exist
ls -d {capability_path}/*-*.feature/ 2>/dev/null
3.2 Evaluate Assumptions
| Scenario | Action |
|---|---|
| Lower-level specs exist with tests | Verify assumptions align |
| Lower-level specs exist without tests | Note gap, continue review |
| Lower-level specs don't exist | Note structural issue, evaluate if tests are appropriately coarse |
Key principle: Specs are DURABLE. They DEMAND assertions. A spec must NEVER say "stories are pending" or "tests will be added later." If lower-level decomposition is needed, those specs should exist.
If spec contains language about missing/pending specs: REJECT. Specs are not working documents.
Atemporal voice (Durable Map Rule): Specs state product truth. They NEVER narrate code history, current state, or migration plans. Any temporal language is a REJECTION — no section gets a pass.
Temporal patterns to reject in specs:
module.py has..." — narrates code statedeprecated/old.py does not exist" — narrates filesystem stateCode that doesn't conform to a spec is discovered through code review and test coverage analysis — the spec itself never names files to delete or code to replace.
3.3 Integration Test Assumptions
For integration tests (Level 2), verify they don't duplicate story-level evidence:
| Integration Test Should | Integration Test Should NOT |
|---|---|
| Verify component contracts | Re-test algorithm correctness |
| Verify interoperation | Exhaustively test edge cases |
| Assume story tests passed | Provide coarse coverage of unit concerns |
If integration tests are doing story-level work because stories don't exist: Note as structural issue. Tests may be legitimately coarse in transitional state, but this should be flagged.
GATE 3: Before proceeding to Phase 4, verify:
If any check fails, STOP and REJECT with detailed findings.
</phase> <phase name="decision_record_compliance"> Check test code against decision records.4.1 Identify Applicable ADRs/PDRs
# Find decision records referenced in spec
grep -oE '\[.*?\]\(.*?\.(adr|pdr)\.md\)' {spec_file}
# Find ADRs/PDRs in ancestry
ls {capability_path}/*.adr.md {capability_path}/*.pdr.md 2>/dev/null
ls {feature_path}/*.adr.md {feature_path}/*.pdr.md 2>/dev/null
4.2 Verify Compliance
For each decision record, extract constraints and verify test code follows them. Use grep to search test files for violation patterns.
If tests violate ADR/PDR constraints: REJECT.
GATE 4: Before proceeding to language-specific phases, verify:
If any check fails, STOP and REJECT with detailed findings.
</phase></review_protocol>
<failure_modes> Failures from actual usage:
Failure 1: Approved tests with silent skips
Failure 2: Missed broken test links
ls -la {container}/tests/{file} for EVERY linked file in Phase 1.2. Don't trust link syntax alone.Failure 3: Approved tests that mocked the SUT
Failure 4: Missed ADR constraint violation
Failure 5: Compared coverage at wrong granularity
Failure 6: Temporal language approved as "observation"
</failure_modes>
<output_format> <approve_template>
## Test Review: {container_path}
### Verdict: APPROVED
All assertions have genuine evidentiary coverage at appropriate levels.
### Assertions Verified
| # | Assertion | Type | Level | Test File | Evidence Quality |
| - | --------- | ------ | ----- | --------- | ---------------- |
| 1 | {name} | {type} | {N} | {file} | Genuine |
### ADR/PDR Compliance
| Decision Record | Status |
| --------------- | --------- |
| {name} | Compliant |
</approve_template>
<reject_template>
## Test Review: {container_path}
### Verdict: REJECT
{One-sentence summary of primary rejection reason}
### Rejection Reasons
| # | Category | Location | Issue | Required Fix |
| - | -------- | ----------- | ------- | ------------ |
| 1 | {cat} | {file:line} | {issue} | {fix} |
### Detailed Findings
#### {Category}: {Issue Title}
**Location**: `{file}:{line}`
**Problem**: {Detailed explanation of why this is a rejection}
**Evidence**:
{Code snippet or grep output showing the issue}
**Required Fix**: {Specific action to resolve}
---
### How Tests Could Pass While Assertion Fails
{Explain the evidentiary gap — how could these tests go green while the promised assertion remains unfulfilled?}
</reject_template>
</output_format>
<rejection_triggers> Quick reference for common rejection triggers:
| Category | Trigger | Verdict |
|---|---|---|
| Spec Structure | Code examples in spec | REJECT |
| Spec Structure | Assertion type doesn't match test strategy (Property without property tests) | REJECT |
| Spec Structure | Missing or broken test file links (inline or table) | REJECT |
| Spec Structure | Language about "pending" specs | REJECT |
| Spec Structure | Temporal language ("currently", "the existing", file references) | REJECT |
| Level | Assertion tested at wrong level | REJECT |
| Dependencies | Skip on required dependency | REJECT |
| Dependencies | Harness referenced but missing | REJECT |
| Decision Record | Test violates ADR/PDR constraint | REJECT |
| Evidentiary | Test can pass with broken impl | REJECT |
Language-specific skills add additional triggers (mocking patterns, type annotations, property testing requirements).
</rejection_triggers>
<success_criteria> Task is complete when:
</success_criteria>
<cardinal_rule> If you can explain how the tests could pass while the assertion remains unfulfilled, the tests are REJECTED.
Your job is to protect the test suite from phantom evidence. A rejected review that catches an evidentiary gap is worth infinitely more than an approval that lets one through.
</cardinal_rule>