Skill

test-driven-development

Enforces red-green-refactor TDD discipline for any task with observable behavior. Writes failing tests before production code, then refactors. Triggers automatically during implementation or on explicit invocation.

testing

developer-tools

npx claudepluginhub heimeshoff/agentheim --plugin agentheim

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agentheim:test-driven-development

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

TDD is the worker's default discipline. The worker writes a failing test that encodes one acceptance criterion, makes it pass with the minimum code, then refactors. Repeat per criterion until the task is done. This skill is doctrine — what TDD means in this repo, when it applies, when it doesn't, and what evidence the worker must report.

SKILL.md

85 lines · ~1.9k tokens

Similar Skills

spectre-tdd

132

Executes strict TDD for tasks: RED (happy/failure failing tests), GREEN (minimal impl), REFACTOR (tidy code). Enforces no prod code without tests first.

spectre

test-driven-development

281

Enforces test-first discipline: write a failing test, then minimal code, then refactor. Routes complex tasks to planning before TDD.

1 file

aegis

test-driven-development

100

Enforces Test-Driven Development: write failing tests first, then minimal code to pass, refactor. For implementation tasks, bug fixes needing regression tests, behavior changes.

1 file

superpowers-optimized

Stats

LanguageHTML

Stars18

Forks4

MaintenanceExcellent

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Test-Driven Development — Red, Green, Refactor

Why TDD here

The agentheim workflow already has a strong gate after the worker returns (verification-before-completion). The verifier reads the diff against acceptance criteria with fresh context. Without TDD, the worker can produce code that looks like it satisfies the criteria but doesn't actually do so under any executed assertion — and the verifier has to re-derive the test space from scratch. With TDD, the verifier can confirm "the tests exist, they assert the criteria, they pass" and spend its energy on the harder question of "are these the right tests".

TDD also fixes the most common failure mode of LLM-generated code on domain-rich projects: plausibly-shaped but behaviorally wrong. A failing test before a single line of production code anchors the worker to observable behavior, not to "code that looks reasonable".

The three laws

Do not write production code until a failing test demands it.
Do not write more of a failing test than is sufficient to fail. (Compilation failure counts.)
Do not write more production code than is sufficient to pass the currently failing test.

The third law is the one workers most often violate by writing six methods when the test demanded one. Resist.

Red, Green, Refactor — the loop per acceptance criterion

For each acceptance criterion in the task file:

Red — write a test that asserts the criterion. Run it. Confirm it fails for the right reason (the assertion fails, not that the test file doesn't compile or the module isn't found). A "red" that's actually a setup error is worthless — it doesn't prove the test will detect the absence of the behavior.
Green — write the minimum production code that makes the test pass. Not the right code, not the elegant code — the minimum. If hardcoding the expected return value passes the test, that's a sign the test under-specifies; write a second test that forces real logic, then implement.
Refactor — with the test green, improve the code's structure without changing behavior. Run the test after each refactor step. If refactoring breaks the test, revert immediately — the refactor was incorrect or the test was specifying implementation rather than behavior.

Then move to the next criterion. Repeat until every checkbox in the Acceptance criteria section corresponds to at least one passing test that would fail without the implementation.

What counts as a test

An executable assertion in the project's test framework (xUnit, Jest, pytest, RSpec, Go test, etc.)
An integration test that exercises a real boundary (database, API, message queue) when the criterion is about that boundary
A property-based test when the criterion describes an invariant over a range of inputs

What does NOT count:

A type signature (types are not assertions about runtime behavior)
A docstring example that isn't executed
"I ran the code manually and it looked right"
A test that asserts the implementation's internal calls (e.g., "method X was called with Y") when the criterion is about externally observable behavior

When TDD does not apply

A small set of tasks legitimately skip TDD. The worker must explicitly note the reason in its return when it does.

type: decision tasks — the deliverable is an ADR, not code. No tests.
type: spike tasks — exploration with explicit throwaway intent. The worker should still write a smoke test for the walking-skeleton spike, but feature spikes can skip.
Pure config / data migration tasks where the assertion is "the config validates and the app boots" — a single integration check covers it; no per-criterion unit tests needed.
Pure documentation tasks — no executable behavior to test.
UI tasks where the project has no UI testing infrastructure — the worker should call this out as a backlog item (add UI test infra) rather than silently skip, and exercise the change manually in the browser as a substitute.

If the worker thinks TDD doesn't apply for any other reason, that's a signal to bounce the task back as under-refined — the acceptance criteria probably aren't testable as written.

TDD against a bounded context's ubiquitous language

Tests are also a place where ubiquitous language lives. A test named it_rejects_a_reservation_that_overlaps_an_existing_one is worth ten tests named test_reservation_validation_3. The worker should name tests using the BC README's terms — and if the right term isn't in the README, that's evidence the README needs an update (do it before writing the test, not after).

What the worker must report

When TDD applies and the worker returns RESULT: SUCCESS, the strict return format includes:

TESTS_ADDED: <integer> — count of new tests written for this task
TESTS_PASSING: yes | no — whether the full test suite passes after the change
TDD_SKIPPED: <reason or "no"> — when TDD legitimately did not apply (per the list above), which reason; otherwise no

If TESTS_PASSING: no, the worker must not return SUCCESS — that's either a FAIL (the worker couldn't get tests green) or a BOUNCE (the task as specified can't be satisfied). Returning SUCCESS with failing tests is a protocol violation.

Anti-patterns

Writing tests after the code. The whole point is that the test drives the design. After-the-fact tests tend to assert what the code happens to do, not what the spec requires.
Asserting implementation details. "The repository was called with these arguments" couples the test to the code's structure. Test the externally observable behavior — what the caller sees.
Over-mocking. Mocks that stub out half the system produce tests that pass while the real integration is broken. Prefer real collaborators (in-memory database, real value objects) for unit tests; reserve mocks for genuinely external systems.
One mega-test per task. Splitting acceptance criteria into separate tests keeps failures localized and the suite diagnostic.
Skipping refactor. The third step is the one that compounds — without it, the codebase degrades even with good test coverage.

Interaction with verification-before-completion

This skill is upstream of the verification gate. A worker that follows TDD produces a diff the verifier can sign off on cheaply. A worker that skips TDD forces the verifier to derive the entire test space — slower, less reliable, and often results in verification failure that re-dispatches the task back to a worker who then has to write the tests anyway. Do TDD first.

test-driven-development

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

test-driven-development

Popularity

Invocation

Context Preview

SKILL.md

Test-Driven Development — Red, Green, Refactor

Why TDD here

The three laws

Red, Green, Refactor — the loop per acceptance criterion

What counts as a test

When TDD does not apply

TDD against a bounded context's ubiquitous language

What the worker must report

Anti-patterns

Interaction with verification-before-completion

Similar Skills

Help us improve

Test-Driven Development — Red, Green, Refactor

Why TDD here

The three laws

Red, Green, Refactor — the loop per acceptance criterion

What counts as a test

When TDD does not apply

TDD against a bounded context's ubiquitous language

What the worker must report

Anti-patterns

Interaction with verification-before-completion