Use when auditing test quality after implementation. Runs four phases: coverage audit, false positive detection via mutation testing, quality rubric evaluation, and letter-grade summary with action items. Invoked by /review-tests and automatically during Stage 4 (Polish) approve-gate.
From casaflownpx claudepluginhub casaperks/casaflow --plugin casaflowThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
This skill evaluates whether tests actually prove what they claim to prove. A test suite that always passes is not the same as a test suite that catches real bugs. The difference is the key insight:
The best way to prove a test matters is to intentionally break the code it covers and confirm the test fails. If breaking the production code breaks the test, the test is real. If the test still passes, it's a false positive — and false positives are worse than no tests, because they provide false confidence.
/casaflow:review-testsClaude reads the test files and the feature spec side by side.
For each acceptance criterion in the spec, answer:
Produce a coverage table:
| Criterion | Happy path | Failure path | Notes |
|---|---|---|---|
| [criterion text] | ✓ / ✗ / ~ | ✓ / ✗ / ~ | [gap or concern] |
Where ✓ = covered, ✗ = missing, ~ = present but weak.
For every test marked ✓ or ~, propose a specific mutation — a small change to the production code that should break the feature but might not break the test.
Format each mutation:
Test:
[test name or describe block]File to mutate:[file path]Mutation: Change[specific line or expression]from[X]to[Y]Expected: The test should fail because[specific reason]Risk: If the test does NOT fail after this mutation, it is a false positive
Instruct the developer to make each mutation, run the affected test, and report the result. After each:
Prioritize the top 3 highest-risk mutations. Focus on:
The developer does not need to run every mutation. Prioritization is Claude's responsibility.
Evaluate each test file against five dimensions. Note violations with file and line references.
Isolation: Does this test depend on state left by another test? Tests that must run in order to pass are coupled. Coupled tests produce false failures in isolation and hide real bugs.
Specificity: Does the test assert what it claims to assert?
A test named should reject invalid input that only checks a 400 status
code is not testing the error message, format, or rejected field. Weak
test dressed as a strong one.
Realistic inputs: Does the test use realistic data?
Tests using id: 1 and name: "test" miss bugs that appear only with
real-world inputs: long strings, special characters, zero, negative numbers,
concurrent requests, unicode, max-length values.
Edge cases: Are the boring edge cases tested? Empty arrays, zero values, maximum values, missing optional fields, null responses. These are the inputs that break real apps in production.
Behavior over implementation: If an internal function is renamed, do the tests still make sense? Tests tied to internal function names break during refactoring and produce false failures. Tests should describe what the system does, not how.
Test suite grade (A / B / C / D):
| Grade | Criteria |
|---|---|
| A | All criteria covered, no false positives found, rubric clean |
| B | Minor coverage gaps, no false positives, small rubric issues |
| C | Coverage gaps or false positives found, rubric violations present |
| D | Significant coverage gaps and/or multiple false positives |
Action items: Numbered list ordered by severity. Each item includes:
What's solid: Explicitly call out tests that are working well. This is not filler — developers need to know what not to change during the fix pass. A good test is an asset and should be named as one.
Ask:
"Do you want me to fix any of the action items, or would you prefer to tackle them yourself first? If you fix them yourself, run
/casaflow:review-testsagain and I'll re-evaluate and update the grade."
Giving the developer the option to fix things themselves is important. The goal is their understanding of the test suite, not just a passing grade.