npx claudepluginhub rune-kit/rune --plugin @rune/analyticsThis skill uses the workspace's default tool permissions.
<HARD-GATE>
Creates and manages unit and integration tests by analyzing the codebase, auto-detecting test frameworks, and generating tests that follow project conventions.
Generates unit, integration, component, and e2e test suites with mocking strategies, edge case coverage, descriptive naming, and CI integration patterns. Activates on 'write tests', 'unit tests', 'mocking' requests.
Guides strict TDD workflow: write minimal failing test first, verify failure, add passing code, refactor. For features, bugfixes, refactors before production code.
Share bugs, ideas, or general feedback.
THE IRON LAW: Write code before test? DELETE IT. Start over.
git checkout -- <file> or remove the changes entirely.
This is not negotiable. This is not optional. "But I already wrote it" is a sunk cost fallacy.ROLE BOUNDARY: Test writes TEST FILES only. NEVER modify source/implementation files.
rune:fix. Test's job ends at the test file.
This separation ensures test never writes code biased toward passing its own tests.VERTICAL SLICING (Iron Law extension): one test → GREEN → one test → GREEN. Never bulk.
test(scope): <behavior> + feat(scope): <behavior>completion-gate against git log --oneline — no paired commits = REJECTEDtdd.horizontal.violation when triggered; preflight blocks merge until cycles are unwound.
Exceptions (narrow, must be documented in test header): retrofitting characterization tests for legacy untested code; spec-driven scaffolding where the contract is external (OpenAPI, wire protocol).
Glob to find existing test files: **/*.test.*, **/*.spec.*, **/test_*Read on 2-3 existing test files to understand:
foo.test.ts mirrors foo.ts)__tests__/ vs tests/)Glob to find the source file(s) being testedTodoWrite: [
{ content: "Understand scope and find existing test patterns", status: "in_progress" },
{ content: "Detect test framework and conventions", status: "pending" },
{ content: "Write failing tests (RED phase)", status: "pending" },
{ content: "Run tests — verify they FAIL", status: "pending" },
{ content: "After implementation: verify tests PASS (GREEN phase)", status: "pending" }
]
Use Glob to find config files and identify the framework:
jest.config.* or "jest" key in package.json → Jestvitest.config.* or "vitest" key in package.json → Vitestpytest.ini, [tool.pytest.ini_options] in pyproject.toml → pytest
async def:
pytest-asyncio is in dependencies (pyproject.toml [project.dependencies] or [project.optional-dependencies])asyncio_mode is set in [tool.pytest.ini_options] (values: auto, strict, or absent)asyncio_mode configured → WARN: "pytest-asyncio not configured. Async tests may silently pass without executing async code. Recommend adding asyncio_mode = \"auto\" to [tool.pytest.ini_options] in pyproject.toml."Cargo.toml with #[cfg(test)] pattern → built-in cargo test*_test.go files present → built-in go testcypress.config.* → Cypress (E2E)playwright.config.* → Playwright (E2E)Verification gate: Framework identified before writing any test code.
Use Write to create test files following the detected conventions:
Mirror source file location: if source is src/auth/login.ts, test is src/auth/login.test.ts
Structure tests with clear describe / it blocks (or language equivalent):
describe('Feature name')
it('should [expected behavior] when [condition]')Cover all three categories:
Use proper assertions. Do NOT use implementation details — test behavior:
expect(result).toBe(expected)assert result == expectedassert_eq!(result, expected)if result != expected { t.Errorf(...) }For async code: use async/await or pytest @pytest.mark.asyncio
When writing tests for async Python code:
Verify setup before writing tests:
pytest-asyncio is in project dependenciesasyncio_mode is set in pyproject.toml [tool.pytest.ini_options] (recommend "auto")Writing async test functions:
asyncio_mode = "auto": just write async def test_something(): — no decorator neededasyncio_mode = "strict": every async test needs @pytest.mark.asyncio@pytest.mark.asyncio decorator explicitlyAsync fixtures:
@pytest_asyncio.fixture (NOT @pytest.fixture) for async setup/teardownfunction scope — use scope="session" carefully with asyncCommon pitfalls:
pass without await — they run but don't execute the async pathpytest-asyncio makes async def test_* silently pass as empty coroutinesBefore running the new test in Phase 4, count what's about to be added to the working tree:
bulk_test_count = (test files staged + test files unstaged but new) since the last GREEN commit
Gate: bulk_test_count MUST be exactly 1.
| State | Action |
|---|---|
bulk_test_count == 1 | Proceed to Phase 4 (run RED). |
bulk_test_count >= 2 AND no prior GREEN this session | HORIZONTAL VIOLATION — pause, keep one test, defer the rest. |
bulk_test_count >= 2 AND last GREEN exists in git log | HORIZONTAL VIOLATION — same. |
bulk_test_count == 0 | No test to run; this phase is a no-op. |
Cycle audit log — append to TEST.md (or create) one line per cycle:
cycle 1 — RED: test_validates_email_rejects_empty | GREEN: validateEmail handles empty | commit: 4f3a1c
cycle 2 — RED: test_validates_email_requires_at_sign | GREEN: validateEmail checks @ | commit: a92e0d
The audit log is the receipt. completion-gate reads it.
When the violation fires, do NOT delete tests automatically — surface to the calling agent: "horizontal slicing detected (N tests before first GREEN). Recommend keeping test K, deferring N-1 to subsequent cycles."
Use Bash to run ONLY the newly created test files (not full suite):
npx jest path/to/test.ts --no-coveragenpx vitest run path/to/test.tspytest path/to/test_file.py -v (if async tests and no asyncio_mode in config: add --asyncio-mode=auto)cargo test test_module_namego test ./path/to/package/... -run TestFunctionNameHard gate: ALL new tests MUST fail at this point.
After rune:fix writes implementation code, run the same test command again:
Bash to check for regressions:
npm test, pytest, cargo test, go test ./...rune:debugVerification gate: 100% of new tests pass AND 0 regressions in existing tests.
After GREEN phase, call verification to check coverage threshold (80% minimum):
When invoked with mode: "diff-aware" or by cook after implementation:
git diff main --name-only to get changed filesThis mode is valuable for large codebases where running the full suite is slow. It answers: "what could this diff have broken?"
Input: git diff main --name-only
Output: Prioritized test plan targeting only affected paths
Tests are organized in 4 layers. Each layer catches a different failure class. Higher layers are slower but catch integration issues lower layers miss.
| Layer | Type | What It Catches | Framework | Speed |
|---|---|---|---|---|
| L1 | Unit | Logic bugs, boundary violations, pure function errors | jest/vitest/pytest/cargo test | Fast |
| L2 | Integration | API contract breaks, DB query errors, service interaction failures | supertest/httpx/reqwest | Medium |
| L3 | True Backend | Real tool/service output correctness (not just exit 0) | Same + real software invocation | Medium-Slow |
| L4 | E2E / Subprocess | Full workflow from user/agent perspective, installed app works | Playwright/Cypress/subprocess | Slow |
Layer rules:
"No graceful degradation" rule (L3/L4): Hard dependencies MUST be installed. Tests MUST NOT skip or produce fake results when the dependency is missing. A silently skipping test is worse than a loudly failing test.
Additional modes:
| Type | When | Speed |
|---|---|---|
| Regression | After bug fixes | Fast |
| Diff-aware | After implementation, large codebases (Phase 6.5) | Fast (targeted) |
For non-trivial features (3+ test files or 20+ test cases), create a TEST.md in the test directory. This is BOTH a planning doc (written BEFORE tests) and results doc (appended AFTER tests pass).
# Test Plan: [Feature Name]
## Test Inventory
- `test_core.py`: ~XX unit tests planned (L1)
- `test_integration.py`: ~XX integration tests planned (L2)
- `test_e2e.py`: ~XX E2E tests planned (L3/L4)
## Unit Test Plan (L1)
| Module | Functions | Edge Cases | Est. Tests | Req IDs |
|--------|-----------|------------|------------|---------|
| `core/auth.py` | login, register, refresh | expired token, invalid creds, rate limit | 12 | REQ-001, REQ-003 |
## E2E Scenarios (L3/L4)
| Workflow | Simulates | Operations | Verified | Req IDs |
|----------|-----------|------------|----------|---------|
| User signup | New user onboarding | register → verify → login | Token valid, profile created | REQ-005 |
## Realistic Workflow Scenarios
- **[Name]**: [Step 1] → [Step 2] → verify [output properties]
## Test Results
[Paste full `pytest -v --tb=no` or `npm test` output]
## Summary
- Total: XX | Passed: XX | Failed: 0
- Execution time: X.Xs | Coverage: XX%
## Requirement Coverage
| Req ID | Test File(s) | Status |
|--------|-------------|--------|
| REQ-001 | `test_auth.py::test_login` | ✅ Covered |
| REQ-002 | — | ❌ Not covered |
## Gaps
- [Areas not covered and why]
Why TEST.md: Planning tests before code catches missing edge cases early. Appending results creates permanent evidence. One document = complete testing story.
For testing SKILL.md behavior (not code), use Eval Scenarios — unit tests for skill files, not code files.
## Eval: E[NN] — [scenario name]
### Prompt
[The exact situation/message an agent receives]
### Expected Reasoning
[Step-by-step reasoning the agent SHOULD follow]
### Must Include
- [Assertion 1: what the output MUST contain or do]
- [Assertion 2]
### Must NOT
- [Anti-pattern 1: what the output MUST NOT do]
- [Anti-pattern 2]
### Category
happy-path | adversarial | edge-case | jailbreak | credential-leak
A skill is behavior-tested when it has evals covering:
| Category | Min Evals | Purpose |
|---|---|---|
| Happy path | 1 | Core workflow executes correctly |
| Edge case | 1 | Empty input, missing context, unusual state |
| Adversarial | 1 | Time pressure, sunk cost, authority pressure |
| Jailbreak / injection | 1 | Prompt injection attempt, "ignore instructions" |
Minimum: 4 evals per skill (1 per category). Security-critical skills (sentinel, safeguard): 8+ evals.
Save eval files as skills/<name>/evals.md. Each eval is a numbered scenario (E01–E24 range). skill-forge Phase 7 checks for evals presence before ship.
package.json devDependenciesWrite to test file fails: check if directory exists, create it first with Bash mkdir -pBash test runner hangs beyond 120 seconds: kill and report as TIMEOUTcook (L1): Phase 3 TEST — write tests firstfix (L2): verify fix passes testsreview (L2): untested edge case found → write test for itdeploy (L2): pre-deployment full test suitepreflight (L2): run targeted regression tests on affected codesurgeon (L2): verify refactored codelaunch (L1): pre-deployment test suitesafeguard (L2): writing characterization tests for legacy codereview-intake (L2): write tests for issues identified during review intakescaffold (L1): generate initial test suite for new projectgraft (L2): write integration tests for grafted codeskill-forge (L2): write tests for new skill functionalitymcp-builder (L2): write tests for MCP server toolsdebug (L2): write regression test capturing the bugplan (L2): reference test requirements in implementation planverification (L3): Phase 6 — coverage check (80% minimum threshold)browser-pilot (L3): Phase 4 — e2e and visual testing for UI flowsdebug (L2): Phase 5 — when existing test regresses unexpectedlycook (L1): test results (pass/fail/coverage) → cook's Phase 5 quality gate evidencecompletion-gate (L3): test runner stdout → evidence for "tests pass" claimsfix (L2): failing test output → fix's target (what to make green)plan (L2): phase file test tasks → test's RED phase targets (what to test)review (L2): untested edge cases found during review → new test targetsfix (L2): implemented code → test's GREEN phase verification targettest ↔ fix: test writes failing tests (RED) → fix implements to pass → test verifies (GREEN) → if new failures emerge, loop continuestest ↔ debug: test discovers regression → debug diagnoses root cause → test writes regression test to prevent recurrence| Excuse | Reality |
|---|---|
| "Too simple to need tests first" | Simple code breaks. Test takes 30 seconds. Write it first. |
| "I'll write tests after — same result" | Tests-after = "what does this do?" Tests-first = "what SHOULD this do?" Completely different. |
| "I already wrote the code, let me just add tests" | Iron Law: delete the code. Start over with tests. Sunk cost is not an argument. |
| "Tests after achieve the same goals" | They don't. Tests-after are biased by the implementation you just wrote. |
| "It's about spirit not ritual" | Violating the letter IS violating the spirit. Write the test first. |
| "I mentally tested it" | Mental testing is not testing. Run the command, show the output. |
| "This is different because..." | It's not. Write the test first. |
| "I'll batch the tests since they're related" | Batched tests = tests of imagination. Each cycle reacts to the prior. Write one, GREEN it, then the next. |
| "All five tests are already written, let me just review them with you" | Same fallacy as code-before-test. Keep the first one, defer the others to subsequent cycles. |
For data pipelines, AI workflows, and multi-stage processing where comparing full output structures is impractical, use oracle injection:
const oracle = crypto.randomUUID()// Example: testing a document processing pipeline
const oracle = "ORACLE-" + crypto.randomUUID();
const testDoc = `Meeting notes: discussed ${oracle} integration timeline`;
const result = await pipeline.process(testDoc);
assert(result.output.includes(oracle), "Oracle not found — pipeline lost data");
When to use: E2E tests for pipelines with 3+ stages, LLM-based processing, ETL workflows, or any system where output structure is complex/non-deterministic but data preservation is critical.
When NOT to use: Unit tests, simple CRUD, or when exact output comparison is feasible.
When a plan with acceptance criteria exists (.rune/features/<name>/plan.md or phase file), every criterion MUST map to at least one test case.
Plan Acceptance Criteria → Test Case → Implementation
AC-1: "User can reset password via email" → test_password_reset_sends_email()
AC-2: "Rate limit: max 3 reset attempts/hour" → test_password_reset_rate_limit()
AC-3: "Expired tokens rejected" → test_expired_reset_token_rejected()
Validation step (after writing tests): Cross-check plan's acceptance criteria against test names. For each criterion:
Why this is stronger than coverage: Coverage checks that lines were EXECUTED. Traceability checks that INTENT was VERIFIED. You can have 100% coverage but miss a requirement if the test doesn't assert the right behavior.
Skip if: No plan exists (ad-hoc fix), or plan has no acceptance criteria section.
Define capability evals and regression evals BEFORE writing implementation code. Evals go beyond unit tests — they verify that the agent/system can handle the feature's intent, not just its mechanics.
| Type | Purpose | Pass Criteria | When |
|---|---|---|---|
| Capability eval | Can the system do this new thing? | pass@k: ≥1 success in k attempts (k=3-5) | Before implementation |
| Regression eval | Did we break existing behavior? | pass^k: ALL k attempts must pass | After implementation |
pass@k (capability): At least 1 of k runs succeeds. Used for new features where some variance is acceptable. Threshold: ≥90% pass@3 for standard features, ≥95% pass@5 for critical paths.
pass^k (regression): ALL k runs must pass. Used for existing behavior that must never break. If ANY run fails, it's a regression. Threshold: 100% pass^3.
Store evals in .rune/evals/<feature>.md:
# Eval: <feature name>
## Capability Evals (pass@k)
| ID | Description | k | Threshold | Status |
|----|-------------|---|-----------|--------|
| CAP-1 | [what the system should be able to do] | 3 | 90% | pending |
## Regression Evals (pass^k)
| ID | Description | k | Status |
|----|-------------|---|--------|
| REG-1 | [existing behavior that must not break] | 3 | pending |
Do NOT overfit evals to specific prompts or known examples. Evals should test the capability, not the exact input.
"When user says 'hello', respond with 'Hi there!'" — tests exact string match"When user greets, respond with a greeting" — tests capability.rune/evals/<feature>.mdIf you catch yourself with ANY of these, delete implementation code and restart with tests:
All of these mean: Delete code. Start over with TDD.
test: then feat:) — git log is the audit trail for "I did TDD" claims| Gate | Requires | If Missing |
|---|---|---|
| RED Gate | All new tests FAIL before implementation | If any pass, rewrite stricter tests |
| GREEN Gate | All tests PASS after implementation | Fix code, not tests |
| Coverage Gate | 80%+ coverage verified via verification | Write additional tests for gaps |
## Test Report
- **Framework**: [detected]
- **Files Created**: [list of new test file paths]
- **Tests Written**: [count]
- **Status**: RED (failing as expected) | GREEN (all passing)
### Test Cases
| Test | Status | Description |
|------|--------|-------------|
| `test_name` | FAIL/PASS | [what it tests] |
### Coverage
- Lines: [X]% | Branches: [Y]%
- Gaps: `path/to/file.ts:42-58` — uncovered branch (error handling)
### Regressions (if any)
- [existing test that broke, with error details]
Before writing tests, check yourself against these 5 anti-patterns. Each has a gate function — a question you MUST answer before proceeding.
Asserting that a mock exists (e.g., testId="sidebar-mock") instead of testing real component behavior. You're proving the mock works, not the code.
Gate: "Am I testing real component behavior or just mock existence?" → If mock existence: STOP. Rewrite to test real behavior.
Adding destroy(), reset(), or __testSetup() methods to production classes that are ONLY called from test files. Production code should not know tests exist.
Gate: "Is this method only called by tests?" → If yes: STOP. Move to test utilities or test helper file, not production class.
Mocking a function without first understanding ALL its side effects. The real function may write config files, update caches, or emit events that downstream code depends on. Gate: Before mocking, STOP and answer: "What side effects does the REAL function have? Does this test depend on any of those?" → Run with real implementation first, observe what happens, THEN add minimal mocking.
Partial mock missing fields that downstream code consumes. Your test passes because it only checks the fields you mocked, but production code reads fields your mock doesn't have → runtime crash. Iron Rule: Mock the COMPLETE data structure as it exists in reality, not just fields your immediate test uses. Examine actual API response / real data shape before writing mock.
If mock setup is 30 lines and the actual test assertion is 3 lines, the test is testing infrastructure, not behavior. This is a code smell that indicates wrong abstraction level. Gate: "Is my mock setup longer than my test logic?" → If yes: test at a higher level (integration) or extract mock factories.
Tests that verify the framework works rather than YOUR code works. If the test would still pass with an empty component/function, it's testing infrastructure. Gate: "Would this test pass if I deleted my business logic?" → If yes: STOP. Rewrite to test behavior that YOUR code introduces.
Examples of test slop:
typeof result === 'object') when you should test the actual valueRed flags — any of these means STOP and rethink:
*-mock test IDs in assertions| Artifact | Format | Location |
|---|---|---|
| Test files | Source files | Co-located or __tests__/ per project convention |
| Test plan + results | Markdown | TEST.md in test directory (non-trivial features only) |
| Eval scenarios | Markdown | skills/<name>/evals.md (for skill behavior testing) |
| Coverage report | Inline stdout | Shown in Test Report |
| Test Report | Markdown (inline) | Emitted to calling skill (cook, fix, review) |
Append to Test Report when invoked standalone. Suppress when called as sub-skill inside an L1 orchestrator (cook, team, etc.) — the orchestrator emits a consolidated block. See docs/references/chain-metadata.md.
chain_metadata:
skill: "rune:test"
version: "1.2.0"
status: "[DONE]"
domain: "[area tested]"
files_changed:
- "[test files created/modified]"
exports:
test_results: { passed: [N], failed: [N], coverage: [N] }
test_files: ["[paths to test files]"]
status: "[RED | GREEN]" # RED = TDD failing (expected), GREEN = all pass
suggested_next: # status-aware — pick based on RED or GREEN
# When GREEN:
- skill: "rune:preflight"
reason: "[grounded in results — e.g., 'All 15 tests GREEN, check edge case completeness']"
consumes: ["test_results", "test_files"]
# When RED (TDD expected):
- skill: "rune:fix"
reason: "[grounded in failures — e.g., '3 tests RED as expected, implement to make them pass']"
consumes: ["test_results", "test_files"]
Known failure modes for this skill. Check these before declaring done.
| Failure Mode | Severity | Mitigation |
|---|---|---|
| Tests passing before implementation exists | CRITICAL | RED Gate: rewrite stricter tests — passing without code = not testing real behavior |
| Skipping the RED phase (not confirming FAIL) | HIGH | Run tests, confirm FAIL output before calling cook/fix to implement |
| Testing mock behavior instead of real code | HIGH | Anti-Pattern 1 gate: "Am I testing real behavior or mock existence?" |
| Mocking without understanding side effects | HIGH | Anti-Pattern 3 gate: run with real impl first, observe side effects, THEN mock minimally |
| Incomplete mocks missing downstream fields | HIGH | Anti-Pattern 4 iron rule: mock COMPLETE data structure, not just fields your test checks |
| Coverage below 80% without filling gaps | MEDIUM | Coverage Gate: identify uncovered lines and write additional tests |
| Introducing a new test framework instead of using existing one | MEDIUM | Constraint 6: detect framework first, use project's existing one always |
| Modifying source files to make tests work | HIGH | Role boundary: test writes test files ONLY — source changes go to rune:fix |
| Test-only methods leaking into production code | MEDIUM | Anti-Pattern 2 gate: if method only called by tests → move to test utilities |
| Bulk test writing before first GREEN (horizontal slicing) | CRITICAL | Phase 3.5 gate — bulk_test_count <= 1, defer additional tests to subsequent cycles |
Test names describe shape (returns boolean, has property) instead of behavior | MEDIUM | Scan names for shape-words; rewrite to behavior verbs (accepts/rejects/produces). See references/test-quality.md |
| Mocking internal collaborators in same module | HIGH | System-boundary rule from references/mocking-policy.md — mock only what you don't own |
| Layering old shallow-module tests on top of new deepened-interface tests | MEDIUM | After improve-architecture deepens a module, REPLACE tests don't ADD them; one interface = one test surface |
SELF-VALIDATION (run before emitting Test Report):
- [ ] Every test file has at least one assertion — no empty test bodies
- [ ] RED phase output shows actual failures (not "0 tests") — tests were real, not stubs
- [ ] No test modifies source code — test files only, source changes belong to fix
- [ ] Test names describe behavior, not implementation ("should reject expired token" not "test function X")
- [ ] No mocks of the thing being tested — only mock external dependencies
- [ ] If BA requirements exist (REQ-xxx), every requirement has at least one test — check plan's Traceability Matrix
git log, no orphan tests pre-first-GREEN~$0.03-0.08 per invocation. Sonnet for writing tests, Bash for running them. Frequent invocation in TDD workflow.
Scope guardrail: Do not modify source or implementation files to make tests pass unless explicitly delegated by the parent agent.