npx claudepluginhub sonthanh/brain-os-pluginThis skill uses the workspace's default tool permissions.
Adapted from [@mattpocockuk/skills/tdd](https://github.com/mattpocock/skills/tree/main/tdd). Discipline harness: write ONE test → make it pass → repeat. Replaces `/develop`'s 7-phase ceremony with vertical-slice loops sized to brain-os artifacts.
Orchestrates RED/GREEN/REFACTOR TDD cycles using context-isolated agents for test-first feature implementation.
Tests Claude Code skills using TDD RED-GREEN-REFACTOR: baseline failures without skill, write fixes, iterate loopholes with pressure scenarios before deployment.
This skill should be used when the user asks to 'implement using TDD', 'test-driven development', 'RED-GREEN-REFACTOR', or 'write failing test first'.
Share bugs, ideas, or general feedback.
Adapted from @mattpocockuk/skills/tdd. Discipline harness: write ONE test → make it pass → repeat. Replaces /develop's 7-phase ceremony with vertical-slice loops sized to brain-os artifacts.
/tdd <artifact> — start a TDD cycle on the artifact (skill name, file path, GitHub child issue number, etc.)./tdd --prior-attempt-failed-because "<hint>" <artifact> — same as above, but BEFORE writing the first test (during Planning), inject Previous attempt failed because: <hint>. Try a different angle. into the planning context. The hint is passed verbatim by /impl's child-level ralph wrapper on retry iterations 2 + 3 (carrying forward the previous iteration's failure summary). /tdd itself does NOT decide when to use the arg — it is consumed only when present.Core principle: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't.
Good tests are integration-style: they exercise real code paths through public APIs. They describe what the system does, not how it does it. A good test reads like a specification — "skill closes its GH issue when the artifact ships" tells you exactly what capability exists. These tests survive refactors because they don't care about internal structure.
Bad tests are coupled to implementation. They mock internal collaborators, test private helpers, or verify through external means (querying state directly instead of using the interface). Warning sign: your test breaks when you refactor but behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.
Mocking rule: only mock at system boundaries — Gmail / Anthropic / NotebookLM APIs, external HTTP, time, randomness, file-system state outside the working tree. Never mock your own modules, scripts, hooks, or vault contents. When a system boundary is unavoidable, design the interface for mockability: pass dependencies in (don't construct them inside), prefer SDK-style methods (api.getUser(id)) over generic fetchers (api.fetch(endpoint)).
DO NOT write all tests first, then all implementation. This is "horizontal slicing" — treating RED as "write all tests" and GREEN as "write all code."
This produces crap tests:
Correct approach: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
WRONG (horizontal):
RED: test1, test2, test3, test4, test5
GREEN: impl1, impl2, impl3, impl4, impl5
RIGHT (vertical):
RED→GREEN: test1→impl1
RED→GREEN: test2→impl2
RED→GREEN: test3→impl3
...
Pick the row matching what you're changing. The test command runs in every RED and every GREEN — that's how you observe the transition.
| Artifact | Test command | Notes |
|---|---|---|
| Bun / TypeScript code | bun test path/to/file.test.ts | Default for new code |
| Bash script | bash -n script.sh && bash script.sh --dry-run | Syntax check + dry-run code path |
| Hook (PreToolUse / SessionStart / etc.) | echo '{json-payload}' | bash hook.sh; echo "exit=$?" | Exit code is the contract; capture stdout/stderr |
| Plist (LaunchAgent) | plutil -lint file.plist | Schema validity only — runtime behavior needs a separate launchd test |
| Python (legacy or migration target) | python3 -m py_compile file.py && pytest path/to/test_file.py | Compile gate + pytest |
Skill with evals/ | /eval <skill-name> | Invokes the skill's own eval suite |
| Skill without evals | Manual: invoke the skill on a real input, observe output vs expected | Add evals/ if the change is non-trivial |
| Vault doc / markdown | grep for required sections, lint structure | E.g., a handover must have "What's Next" + "Commands to Verify State" |
If your artifact isn't listed, name the verifier explicitly in the Planning phase before writing the first test.
Default behavior across artifact types: a runner report containing SKIPPED cases counts as GREEN provided no test failed. The exception below tightens that contract; it applies ONLY when /tdd is invoked on a GitHub child issue whose body carries ## Parent and a ## Covers AC H2 section (the structured shape filed by /slice). Generic-artifact invocations — direct /tdd on a bash script, plist, vault doc, or skill with no enclosing GitHub issue — are unaffected.
Detection (single source of truth — do NOT re-derive parser shape, regex, or the live-AC marker list here; cite by section number):
## Covers AC per references/ac-coverage-spec.md § 1 (Covers AC parser regex). Collect the integer ID set. Empty set → pure-component child → standard GREEN, rule does not fire.## Parent section) and parse ## Acceptance per spec § 2.references/ac-coverage-spec.md § 6 to each parent AC bullet — yields the parent's live-AC ID set.Rule: when the child is a live-AC child AND the test runner output reports any SKIPPED test case, /tdd refuses GREEN and emits RED with the exact message:
Live-AC child has SKIPPED tests — make them run or remove their skip gate.
The child must either remove the skip gate (env-var, .skip() modifier, conditional describe.skip) or actually run the previously-skipped path. Pure-component children (empty ## Covers AC ID set, or a set whose every ID maps to a non-live parent AC) keep the previous behavior — SKIPPED cases pass through GREEN unchanged.
Before writing any code:
--prior-attempt-failed-because "<hint>": inject Previous attempt failed because: <hint>. Try a different angle. into the planning context as the FIRST line, before any other planning step. This anchors the rest of the plan against the prior failure.Ask: "What should the public interface look like? Which behaviors matter most?"
You can't test everything. Confirm with the user exactly which behaviors matter. Focus testing effort on critical paths and complex logic, not every possible edge case.
Write ONE test that confirms ONE thing about the system end-to-end:
RED: Write test for first behavior → run test command → test fails
GREEN: Write minimal code to pass → run test command → test passes
This is your tracer bullet — proves the path works end-to-end. Skip stubs, mocks of internal pieces, and scaffolding. Just the smallest real thing.
For each remaining behavior:
RED: Write next test → fails
GREEN: Minimal code to pass → passes
Rules:
After all tests pass, look for:
Never refactor while RED. Get to GREEN first.
[ ] Test describes behavior, not implementation
[ ] Test uses public interface only
[ ] Test would survive an internal refactor
[ ] Code is minimal for this test (no speculative features)
[ ] Test command was actually run (not just "looks right")
/improve or /audit instead/tdd once the target is clearFollow skill-spec.md § 11. Append to {vault}/daily/skill-outcomes/tdd.log:
{date} | tdd | {action} | ~/work/brain-os-plugin | {artifact-relative-path} | commit:{hash} | {result}
action: feature (new behavior), bugfix (fix broken behavior), refactor (no behavior change, covered by tests), or port (adapt from external source)result: pass if all tests green AND no reactive improve: encoded feedback — commit lands against this artifact within 48h; partial if user manually corrected mid-cycle but tests still passed; fail if cycle abandoned or reactive patch neededargs="{artifact-name}", cycles=N (red-green count), corrections=NThe 48h-no-reactive-patch criterion is the discipline test from the Phase A handover (daily/handovers/2026-04-25-skill-process-phase-a-tdd.md). If pass-rate stays high, /tdd's smaller slices are doing what /develop's checklist couldn't. If reactive patches keep landing, the deferred reviewer hook (Q6 in the grill) comes back.