From ship-it-ops
Apply testing best practices (test design, TDD, test strategy, mocking, integration testing, CI/CD testing) when writing or reviewing tests for Python, TypeScript/JavaScript, or Java code. Invoke explicitly for test reviews or test strategy assessments. Do not invoke for shell scripts, config files, or one-off scripts.
npx claudepluginhub ship-it-ops/ship-code --plugin ship-tested-codeThis skill is limited to using the following tools:
This skill applies testing best practices plus modern software engineering principles to help you write effective, maintainable, and reliable tests. It operates in two modes: writing (apply silently) and review (structured report).
Guides strict Test-Driven Development (TDD): write failing tests first for features, bugfixes, refactors before any production code. Enforces red-green-refactor cycle.
Guides systematic root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
This skill applies testing best practices plus modern software engineering principles to help you write effective, maintainable, and reliable tests. It operates in two modes: writing (apply silently) and review (structured report).
Start with these 3 rules and internalize them before learning the rest:
The detailed reference files (reference.md, reference-smells.md) assume familiarity with testing frameworks and patterns — build up to those over time.
Writing mode (default when generating or modifying test code): Apply all principles proactively. Produce well-structured, behavior-focused tests by default without commentary unless asked. Do not explain the principles being applied -- just write good tests.
Review mode (when explicitly reviewing tests, using /ship-tested-code, or asked to review): Read the target test code, analyze against the rules below, and produce a structured report using the Review Output Format defined in this skill.
Trigger review mode when the user says: "review tests", "test review", "check my tests", "test quality", or invokes the skill explicitly. When in doubt, default to writing mode.
These 12 rules apply to ALL tests, ALL languages, EVERY time:
Tests should survive refactoring. Query by role/behavior, not internal state. If changing internals (without changing behavior) breaks tests, the tests are coupled to implementation. Do not test private methods, auto-generated code, or framework wiring.
Each test verifies a single concept. Multiple assertions are fine if they verify the same behavior. If a test name needs "and", split it. If a test can fail for five different reasons, it is five tests pretending to be one.
Every test has three distinct phases: set up the scenario, perform the action, verify the result. Keep them visually separated. Do not tangle setup with assertions or interleave multiple act-assert sequences.
A test that passes 99% of the time is a flaky test. No sleep(), no shared mutable state,
no reliance on execution order, no external service calls without isolation, no
Math.random() or datetime.now() without controlled injection.
A failing test name should tell you exactly what broke. Use should_X_when_Y or
test_X_given_Y. Never test1, testProcess, or testHappyPath. If your tests were the
only documentation, a reader should understand the system's behavior.
Happy paths get validated by usage. Error handling, edge cases, timeouts, empty inputs, nulls, maximum values, unicode, concurrent access -- test these deliberately. Production bugs live in the negative paths.
Unit tests for business logic, integration tests at service boundaries, contract tests between services, E2E only for critical user journeys. Do not use E2E for what a unit test can catch. The right distribution depends on your architecture, not a universal shape.
Tests must be executable in any order, concurrently. Each test creates its own data and cleans up after itself. No shared mutable state between tests. If tests cannot run in parallel, there is a design problem.
Mock external APIs, clocks, randomness. Use real (containerized) databases and queues when practical. Prefer fakes over mocks for state-based tests. Never mock value objects. When you mock everything, you test that your mocks are wired correctly, not that your code works.
If your system failed in a way tests did not predict, fill the gap. This is how you build a safety net over time. Bugs cluster -- when you find one, test adjacent logic exhaustively.
Use builder/factory patterns for test data. When a model adds a field, fix one factory,
not 400 tests. Make test data reveal intent: UserFactory.withExpiredSubscription(),
not User("John", "john@test.com", null, null, true, 0).
Tests have maintenance costs. Skipped tests older than 30 days: fix or delete. Tests that have not failed in 2 years on stable code: evaluate whether they provide value. Coverage-gaming tests with weak assertions: delete. A fast, reliable suite of 200 tests beats a slow, flaky suite of 2000.
When reviewing test code, report issues in this priority order:
T1 - MISSING COVERAGE: Untested critical paths, missing error/edge case tests, no regression test for known bugs, business logic without tests.
T2 - FLAKY/UNRELIABLE: Timing dependencies (sleep), shared mutable state,
non-deterministic data, external service calls without isolation, order-dependent tests.
T3 - WRONG LEVEL: E2E test for what a unit test should catch, mocking everything (testing mocks not code), testing framework internals instead of your code.
T4 - POOR DESIGN: Implementation-coupled tests, tests with multiple reasons to fail, unclear test names, AAA structure violated, testing private methods.
T5 - TEST DATA: Hardcoded magic values, shared mutable fixtures, no factories, production data in tests, missing edge case data (nulls, empty, unicode, boundaries).
T6 - MAINTAINABILITY: Duplicate test setup across files, over-abstracted helpers (5 levels deep), excessive mocking setup longer than test logic, brittle selectors/locators.
T7 - ASSERTIONS: Weak assertions (assertNotNull when you mean assertEquals),
missing assertions (test exercises code but verifies nothing), snapshot abuse on
large/changing components, no assertion messages.
Rules for when NOT to be strict:
Don't demand tests for trivial code. Getters, setters, simple delegation, framework-generated code do not need tests. Test what could break.
Legacy code gets the Boy Scout Rule. When touching untested code: add a test for your change. Do not demand full coverage of the file. Leave it slightly better.
Prototype code gets a pass. If the user says "quick hack", "prototype", "spike", relax test requirements. If it survives to production, tests come first.
Coverage is a floor, not a goal. 70-80% branch coverage is a healthy baseline. Never chase 100% — it incentivizes low-value tests. Use coverage to find gaps, not to declare victory.
Speed trumps comprehensiveness. A fast, reliable suite of 200 tests beats a slow, flaky suite of 2000. Optimize for feedback speed.
Match existing test patterns. If the project uses a specific test structure, naming convention, or framework, follow it. Consistency with the codebase outweighs ideals.
Never block on T7 issues. Assertion style suggestions are just that -- suggestions.
Detect the programming language from file extensions and context. Load the appropriate language-specific reference:
.py files -> Read ${SKILL_DIR}/lang-python.md.ts, .tsx, .js, .jsx files -> Read ${SKILL_DIR}/lang-typescript.md.java files -> Read ${SKILL_DIR}/lang-java.mdApply universal principles first, then layer language-specific idioms on top. When the language is ambiguous or not covered, apply only universal principles.
When in review mode, produce this structured output:
## Test Review: [filename or scope]
### Critical (must fix before merge)
- **[T1-COVERAGE] Line XX**: [Missing test description]. -> [What to test and how].
- **[T2-FLAKY] Line XX**: [Flakiness source]. -> [Specific fix].
### Important (should fix)
- **[T3-LEVEL] Line XX**: [Problem]. -> [Better approach].
- **[T4-DESIGN] Line XX**: [Problem]. -> [Fix suggestion with code snippet].
- **[T5-DATA] Line XX**: [Problem]. -> [Fix suggestion].
### Suggestions (improve when convenient)
- **[T6-MAINT] Line XX**: [Problem]. -> [Fix suggestion].
- **[T7-ASSERT] Line XX**: [Problem]. -> [Fix suggestion].
### What's Good
- [Substantive positive observations about test design, coverage strategy, or patterns done well.]
Rules for the output:
Before applying testing rules, check if ${SKILL_DIR}/overrides.md exists. If it does,
read it and apply its overrides. Team overrides supersede defaults. Use this for: test
framework preferences, coverage thresholds, naming convention deviations, disabled rules.
If a project has a .claude/ship-tested-code-overrides.md file, read it as well —
project-level overrides take precedence over skill-level overrides.
Phased rollout recommended:
Track: T1/T2 findings per PR (should trend toward zero), flaky test rate, escaped defect rate.
For deeper analysis, load supporting reference files:
${SKILL_DIR}/reference.md${SKILL_DIR}/reference-smells.md${SKILL_DIR}/lang-{language}.md${SKILL_DIR}/examples/Load these on-demand when doing thorough reviews or when the user asks for detailed guidance on a specific testing topic.