From code-paper-test
Mentally executes code line-by-line with concrete values to find bugs, logic errors, edge cases, and AI hallucinations. Use for dry runs, code audits, and validating implementations before deployment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/code-paper-test:paper-testinheritThis skill is limited to the following tools:
These tools are removed from Claude's available pool while this skill is active:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematically test code by mentally executing it line-by-line with concrete values.
references/advanced-techniques.mdreferences/ai-code-auditing.mdreferences/behavioral-verification.mdreferences/blind-ab-comparison.mdreferences/common-flaws.mdreferences/contract-patterns.mdreferences/core-method.mdreferences/dependency-verification.mdreferences/fork-vs-fresh.mdreferences/hybrid-testing.mdreferences/json-output-schema.mdreferences/rubric-scoring.mdreferences/severity-scoring.mdreferences/skill-and-config-testing.mdreferences/structured-3-phase.mdreferences/workflow.mdSystematically test code by mentally executing it line-by-line with concrete values.
| Target Size | Approach | Why |
|---|---|---|
| < 50 lines | Quick trace (workflow below) | Fast, inline, sufficient for small code |
| 50–300 lines | Structured 3-phase | One agent, all 3 perspectives, sequential — thorough without coordination overhead |
| 300+ lines or security-critical | /code-paper-test:test-team (3 agents) | Context pressure justifies splitting. Cross-challenge debate catches what one agent misses. |
| Skill/command/agent files | /code-paper-test:test-team | Different lenses genuinely find different things for instruction-based testing |
If the user asks for "paper test" without specifying, read the target files, count lines, and recommend the appropriate approach. For 50–300 lines, use Structured 3-Phase mode. Only recommend /test-team for 300+ lines, explicit "test team" requests, or security-critical code.
Read references/structured-3-phase.md for the full methodology. It runs Phase A (happy path), Phase B (edge cases, 6 categories), Phase C (adversarial, 5 categories), and Phase D (self-review) sequentially in one agent.
For small code, skip the structured phases — just trace with concrete values following the workflow below.
Follow code logic with concrete test cases to find:
NOT just reading — actually run the code in your head with real values.
The number of scenarios traced per phase scales with the active effort level so
the skill is honest about cost in CI and smoke runs. The current effort level is
${CLAUDE_EFFORT}. Apply this floor:
| Effort | Scenarios per phase | Posture |
|---|---|---|
low | 1 (happy path) + 1 error case | Fast smoke check — minimum honest coverage |
medium | 2 per phase | Balanced — happy path, key edge cases, main error path |
high / xhigh / max | 3+ per phase | Thorough — full edge and adversarial coverage |
Regardless of effort, ALWAYS trace at least one happy path and one error case,
and ALWAYS verify every external call (effort never lowers the verification
bar — it only scales scenario breadth). When ${CLAUDE_EFFORT} is unset (model
without effort support), default to medium.
Trace the target with concrete values. The 8-step workflow:
references/behavioral-verification.md §B1 (code/library calls) and §B2 (plugin/MCP/hook/skill references). Closed-source fallback: apply taint stance; flag as behavioral gap if no validation wrapper exists. Chained-object rule: trace every property/method invoked on the return object, not just the return type.Full step-by-step detail with templates for each step, the verification
procedures, the flaw catalog summary, the module testing strategy, and the
output template are in references/workflow.md.
For every external method, service, interface, or config value the code touches,
verify it with the Read and Grep tools — do not guess. If the source is
unavailable (closed-source package), mark it explicitly as an UNVERIFIED RISK
rather than assuming it works. This is the single highest-value discipline of
paper testing, especially for AI-generated code. Procedures: references/workflow.md
(verification section), references/dependency-verification.md,
references/contract-patterns.md.
A method existing is not it returning what you assume. Existence verification is the first pass; behavioral verification is the second: locate the declared contract (type stub → OpenAPI → docs → docblock), enumerate every assumption the caller makes about the return, and diff. For closed-source targets with no contract: apply the taint stance — assume the return could be null, hostile, or malformed. See references/behavioral-verification.md.
--json)For CI integration, aggregation, or programmatic consumption, invoke with --json to emit a stable, versioned JSON document instead of the markdown report.
/paper-test (quick and structured-3-phase modes) and /code-paper-test:test-team (Synthesizer aggregation).schema_version: "1.1"). CI should pin ^1\., not an exact match.CRITICAL, HIGH, MEDIUM, LOW, INFO.findings is always an array — [] when clean, never null or omitted.status is the overall gate verdict: pass / warning / fail. Use pass only when no MEDIUM-or-higher findings exist and the run completed fully.Use JSON mode for: CI gates, dashboards, pipelines chaining into jq or monitoring. Stay in markdown for: interactive analysis and educational traces. Full schema, finding-object shape, team-report extensions, skill/config categories, optional rubric_score block, and CI gate patterns: see references/json-output-schema.md.
/paper-test --json src/Service/UserService.php
/code-paper-test:test-team --json src/Service/PaymentService.php
skill-quality-reviewer for Skill TestingWhen paper-testing a skill, command, or agent file, run plugin-creation-tools:skill-quality-reviewer first (deterministic: stale SDK refs, dropped imperatives, frontmatter gaps) then paper-test for the semantic analysis (instruction fidelity, trigger coverage, context budget). See references/skill-and-config-testing.md §"Deterministic + Agentic pairing". In skill-mode, after verifying tool/file/skill references exist, verify each referenced capability PRODUCES what the calling step consumes — see references/behavioral-verification.md §B2.
Layering with native security review. The Red Team Attacker lens in /code-paper-test:test-team complements the security-guidance plugin — security-guidance catches issues in Claude's own edits in real time; the Red Team Attacker finds adversarial vulnerabilities in the target code at analysis time. They cover different moments and are not substitutes.
All detailed guides are in the references/ directory:
references/workflow.md — full 8-step workflow, verification detail, flaw catalog, module strategy, output templatereferences/core-method.md — complete paper testing method with worked examplesreferences/structured-3-phase.md — the 50–300 line single-agent 3-phase methodologyreferences/dependency-verification.md — how to verify external callsreferences/contract-patterns.md — all code contract typesreferences/ai-code-auditing.md — testing AI-generated codereferences/fork-vs-fresh.md — decision record: why /test-team spawns fresh-context (not forked) teammatesreferences/hybrid-testing.md — module-level testing strategyreferences/common-flaws.md — catalog of frequent bugsreferences/advanced-techniques.md — progressive injects, red team testing, attack surface analysis, AAR formatreferences/severity-scoring.md — consistent severity rubric for flaw prioritizationreferences/blind-ab-comparison.md — comparing two implementations side by sidereferences/rubric-scoring.md — structured grading for code quality assessmentreferences/skill-and-config-testing.md — testing skills, commands, agents, and configsreferences/json-output-schema.md — stable JSON schema for --json mode (CI integration)references/behavioral-verification.md — B1 (code/library behavioral contracts) and B2 (plugin/MCP/hook/skill output contracts)npx claudepluginhub camoa/claude-skills --plugin code-paper-testSpawns a fresh-context adversarial reviewer to cross-examine non-trivial decisions before they stand. Use when correctness matters more than speed, in unfamiliar code, or for high-stakes operations.
Subjects non-trivial decisions to a fresh-context adversarial review before finalizing. Use for high-stakes code, unfamiliar logic, or when correctness outweighs speed.
Probes docs, skills, plans, or claims for weaknesses, gaps, or unstated assumptions before shipping. Returns structured verdicts.