From cape
Audit quality of existing tests — identify tautological tests, coverage gaming, weak assertions, and missing corner cases. Use this skill whenever the user mentions test quality, test effectiveness, tautological tests, coverage gaming, "are these tests any good", "audit tests", "review test quality", test rot, flaky tests, or wants to understand whether existing tests actually catch bugs. Also triggers on: "these tests feel useless", "coverage is high but bugs keep shipping", "would these tests catch a real bug", pointing at a test file and asking if it's worth keeping. This skill audits EXISTING test quality — not finding missing tests (use find-test-gaps), writing new tests (use test-driven-development), or debugging test failures (use debug-issue).
npx claudepluginhub sqve/cape --plugin capeThis skill uses the workspace's default tool permissions.
<skill_overview> Audit a user-specified scope of existing tests and categorize each as RED (remove
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
<skill_overview> Audit a user-specified scope of existing tests and categorize each as RED (remove or replace), YELLOW (strengthen), or GREEN (keep). Uses code-review-graph for structural context and reads production code before evaluating tests.
Core contract: every RED or YELLOW verdict must cite the specific line or pattern that makes the test weak, and explain what false confidence it creates. "This test could be better" is not a verdict — "this test passes even when the production code is deleted" is. </skill_overview>
<rigidity_level> HIGH FREEDOM — The categorization framework (RED/YELLOW/GREEN) and the requirement to read production code before tests are rigid. How deep the analysis goes within each module adapts to complexity and risk. The structural context step (code-review-graph) is rigid when a graph is available. </rigidity_level>
<when_to_use>
Don't use for:
cape:find-test-gaps)cape:test-driven-development)cape:debug-issue)</when_to_use>
<critical_rules>
</critical_rules>
<the_process>
Run cape check to establish a baseline. If exit code is non-zero, stop — do not proceed. Read
checkResults from JSON output and report entries where passed: false. Audit test quality only
when the suite is green.
If the user's message doesn't include a clear scope, ask:
What tests should I audit? Examples:
- A directory: src/auth/
- A test file: src/auth/login.test.ts
- A module: the payment processing tests
Once scope is clear, use code-review-graph to build structural context before reading any files. See
resources/graph-tools-instructions.md for the full tool catalog and fallback behavior.
Graph queries to run (in order):
semantic_search_nodes_tool with kind: "Test" to find test entities in scopequery_graph_tool with tests_for on each production file to confirm test-to-source mappingsquery_graph_tool with callers_of on key production functions to understand their importance —
heavily-called functions deserve more scrutiny on their testsget_impact_radius_tool on the production files to understand blast radius — tests guarding
high-impact code get more scrutinyPresent the scope summary:
Scope: src/auth/
Test files: 3 (login.test.ts, session.test.ts, permissions.test.ts)
Production files: 4 (login.ts, session.ts, permissions.ts, types.ts)
Framework: vitest
High-impact functions: authenticate() (12 callers), validateSession() (8 callers)
Production code first. You cannot judge a test without understanding what it should verify. For each production file in scope, read it and note:
Then read the corresponding test file and evaluate each test against the production code.
RED — Remove or replace. These tests create false confidence. They pass regardless of whether production code works correctly.
| Pattern | Example |
|---|---|
| Tautological | Asserts the mock returns what you told it to return |
| Mock-dominated | Verifies mock wiring, not behavior — production code could be deleted and the test still passes |
| Coverage gaming | Calls the function but asserts nothing meaningful (e.g., expect(result).toBeDefined() on a function that always returns an object) |
| Testing the framework | Verifies that the test framework, language runtime, or library works correctly rather than testing application logic |
| Frozen snapshot | Snapshot test that gets blindly updated whenever it fails — it documents current output, not correct output |
YELLOW — Strengthen. These tests have value but miss important cases or use weak patterns that reduce their effectiveness.
| Pattern | Example |
|---|---|
| Happy-path only | Tests the sunny day but no error paths, despite the production code having explicit error handling |
| Weak assertions | toBeTruthy() or not.toBeNull() when a specific value should be checked |
| Brittle coupling | Depends on implementation details (string-matching error messages, asserting call order, testing private methods) |
| Missing corner cases | Tests the common case but not boundaries (empty input, max values, concurrent access) |
| Incomplete error path | Tests that an error is thrown but not the error type, message, or downstream effect |
GREEN — Keep as-is. Tests that genuinely guard behavior. They would fail if the production code broke in the way they test.
A test earns GREEN when: it asserts specific behavior, targets a real code path, and would fail if that path changed.
Start from the assumption that a test is YELLOW until proven GREEN. GREEN is the exception — most test suites have more strengthening opportunities than they appear to at first glance.
For each test (or tightly-related group), record:
[RED|YELLOW|GREEN] test description or name
Line: test file:line_number
Verdict: [specific reason citing the pattern from the table above]
Production code: [which function/path this test is supposed to guard]
[RED/YELLOW only] Fix: [what to do — delete, replace with X, add assertion for Y]
Before presenting findings, review your own categorizations. This catches false positives that would waste the user's time.
For each RED verdict, verify:
For each GREEN verdict, challenge:
Downgrade uncertain REDs to YELLOW. Upgrade suspicious GREENs to YELLOW. When in doubt, YELLOW.
Group findings by module. Lead with the summary, then details.
Analyzed [N] tests across [M] files.
RED: [count] — remove or replace (false confidence)
YELLOW: [count] — strengthen (partial value)
GREEN: [count] — keep as-is (genuine guards)
### src/auth/login.test.ts — 8 tests (2 RED, 4 YELLOW, 2 GREEN)
Production file: src/auth/login.ts
Key functions: authenticate() (12 callers), resetPassword()
RED tests:
1. "should call the database" (line 23)
Asserts mock.toHaveBeenCalled() after calling authenticate() which always calls the DB.
Tautological — passes even if authenticate() stops validating credentials.
Fix: Replace with test that verifies authenticate() rejects invalid credentials.
2. "should return user object" (line 45)
Asserts result is not null. authenticate() always returns an object (throws on failure).
Coverage gaming — asserts nothing about correctness.
Fix: Assert specific user fields match input credentials.
YELLOW tests:
3. "should reject wrong password" (line 67)
Good intent but asserts only that an error is thrown, not the error type.
Fix: Assert AuthenticationError with specific code.
4-5. "should handle login" / "should process valid credentials" (lines 89, 102)
Happy-path only. No test for: expired credentials, locked account, rate limiting.
Fix: Add error-path tests for each guard clause in authenticate().
GREEN tests:
6. "should hash password before comparison" (line 134)
Verifies bcrypt.compare is called with the raw password and stored hash.
Would fail if hashing logic changed.
After presenting all modules, ask before creating br items:
Found [N] RED and [M] YELLOW tests across [K] modules.
Create a br epic with improvement tasks? I can drop any findings you disagree with.
STOP here. You MUST wait for explicit user approval before creating br items. Do not call
br create until the user responds.
After user approval, create a br epic and one task per module.
Create a br epic following this template:
!cat "${CLAUDE_SKILL_DIR}/../write-plan/resources/epic-template.md"
Populate Requirements from the RED/YELLOW findings, Anti-patterns from observed test smells, and
Success criteria from the improvement targets. Use --type epic --priority 2. Run
cape br validate <epic-id> after creation.
br create "Improve tests in [module name]" \
--type task \
--parent <epic-id> \
--priority <assessed-priority> \
--labels "analyze-tests" \
--description "$(cat <<'EOF'
## Goal
Fix [R] RED and [Y] YELLOW tests in [file path].
## RED — Remove or replace
1. [test name] (line N) — [verdict]. Replace with: [specific replacement].
2. ...
## YELLOW — Strengthen
1. [test name] (line N) — [verdict]. Fix: [specific improvement].
2. ...
## GREEN — No action
[List green tests so the implementer knows what not to touch]
## Implementation
- Test file: [path]
- Production file: [path]
- Framework: [framework]
- For each replacement: write the new test, verify it fails against broken production code,
then verify it passes against correct code
## Success criteria
- [ ] [test name]: replaced with test that verifies [specific behavior]
- [ ] [test name]: assertion strengthened to check [specific value]
- [ ] All replacement tests fail when production behavior breaks
EOF
)"
cape br validate <task-id>
Present the created epic and tasks, then suggest cape:execute-plan to start implementing.
</the_process>
<agent_references>
cape:test-auditor protocol:Pass: production file path, corresponding test file path, graph findings (impact radius, callers). Expect back: per-test verdicts (RED/YELLOW/GREEN) with line references and fix descriptions.
Dispatch parallel subagents when scope contains many test files — each reads one production file + its test file and returns categorized verdicts.
</agent_references>
User asks to audit test quality in a directoryUser: "Are the tests in src/billing/ any good?"
Wrong: Skim the test files and report "looks fine, they have decent coverage." Coverage says nothing about quality. A test suite can hit 95% coverage while every test is tautological.
Right:
User: "We have 90% coverage in src/api/ but keep finding bugs in production. What's wrong?"
Wrong: Suggest adding more tests to get to 95%. More of the same bad tests won't help.
Right:
User: "Is src/auth/session.test.ts worth keeping?"
Wrong: Count the tests and say "12 tests, seems comprehensive." Or read only the test file without the production code and guess at quality.
Right:
<key_principles>
</key_principles>