Help us improve
Share bugs, ideas, or general feedback.
From agentheim
Enforces red-green-refactor TDD discipline for any task with observable behavior. Writes failing tests before production code, then refactors. Triggers automatically during implementation or on explicit invocation.
npx claudepluginhub heimeshoff/agentheim --plugin agentheimHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentheim:test-driven-developmentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
TDD is the worker's default discipline. The worker writes a failing test that encodes one acceptance criterion, makes it pass with the minimum code, then refactors. Repeat per criterion until the task is done. This skill is doctrine — what TDD means in this repo, when it applies, when it doesn't, and what evidence the worker must report.
Executes strict TDD for tasks: RED (happy/failure failing tests), GREEN (minimal impl), REFACTOR (tidy code). Enforces no prod code without tests first.
Enforces test-first discipline: write a failing test, then minimal code, then refactor. Routes complex tasks to planning before TDD.
Enforces Test-Driven Development: write failing tests first, then minimal code to pass, refactor. For implementation tasks, bug fixes needing regression tests, behavior changes.
Share bugs, ideas, or general feedback.
TDD is the worker's default discipline. The worker writes a failing test that encodes one acceptance criterion, makes it pass with the minimum code, then refactors. Repeat per criterion until the task is done. This skill is doctrine — what TDD means in this repo, when it applies, when it doesn't, and what evidence the worker must report.
The agentheim workflow already has a strong gate after the worker returns (verification-before-completion). The verifier reads the diff against acceptance criteria with fresh context. Without TDD, the worker can produce code that looks like it satisfies the criteria but doesn't actually do so under any executed assertion — and the verifier has to re-derive the test space from scratch. With TDD, the verifier can confirm "the tests exist, they assert the criteria, they pass" and spend its energy on the harder question of "are these the right tests".
TDD also fixes the most common failure mode of LLM-generated code on domain-rich projects: plausibly-shaped but behaviorally wrong. A failing test before a single line of production code anchors the worker to observable behavior, not to "code that looks reasonable".
The third law is the one workers most often violate by writing six methods when the test demanded one. Resist.
For each acceptance criterion in the task file:
Red — write a test that asserts the criterion. Run it. Confirm it fails for the right reason (the assertion fails, not that the test file doesn't compile or the module isn't found). A "red" that's actually a setup error is worthless — it doesn't prove the test will detect the absence of the behavior.
Green — write the minimum production code that makes the test pass. Not the right code, not the elegant code — the minimum. If hardcoding the expected return value passes the test, that's a sign the test under-specifies; write a second test that forces real logic, then implement.
Refactor — with the test green, improve the code's structure without changing behavior. Run the test after each refactor step. If refactoring breaks the test, revert immediately — the refactor was incorrect or the test was specifying implementation rather than behavior.
Then move to the next criterion. Repeat until every checkbox in the Acceptance criteria section corresponds to at least one passing test that would fail without the implementation.
What does NOT count:
A small set of tasks legitimately skip TDD. The worker must explicitly note the reason in its return when it does.
type: decision tasks — the deliverable is an ADR, not code. No tests.type: spike tasks — exploration with explicit throwaway intent. The worker should still write a smoke test for the walking-skeleton spike, but feature spikes can skip.If the worker thinks TDD doesn't apply for any other reason, that's a signal to bounce the task back as under-refined — the acceptance criteria probably aren't testable as written.
Tests are also a place where ubiquitous language lives. A test named it_rejects_a_reservation_that_overlaps_an_existing_one is worth ten tests named test_reservation_validation_3. The worker should name tests using the BC README's terms — and if the right term isn't in the README, that's evidence the README needs an update (do it before writing the test, not after).
When TDD applies and the worker returns RESULT: SUCCESS, the strict return format includes:
TESTS_ADDED: <integer> — count of new tests written for this taskTESTS_PASSING: yes | no — whether the full test suite passes after the changeTDD_SKIPPED: <reason or "no"> — when TDD legitimately did not apply (per the list above), which reason; otherwise noIf TESTS_PASSING: no, the worker must not return SUCCESS — that's either a FAIL (the worker couldn't get tests green) or a BOUNCE (the task as specified can't be satisfied). Returning SUCCESS with failing tests is a protocol violation.
This skill is upstream of the verification gate. A worker that follows TDD produces a diff the verifier can sign off on cheaply. A worker that skips TDD forces the verifier to derive the entire test space — slower, less reliable, and often results in verification failure that re-dispatches the task back to a worker who then has to write the tests anyway. Do TDD first.