From shipyard
Teaches TDD for SKILL.md files: baseline agent failure tests, minimal docs to fix violations, verify compliance, refactor cycles. Use when creating, editing, or reviewing skills.
npx claudepluginhub lgbarn/shipyard --plugin shipyardThis skill uses the workspace's default tool permissions.
<!-- TOKEN BUDGET: 380 lines / ~1140 tokens -->
Guides TDD-style skill creation: pressure scenarios as tests, baseline agent failures, write docs to enforce compliance, verify with RED-GREEN-REFACTOR.
Applies TDD to skill creation: test agent failures without docs (RED), write minimal SKILL.md (GREEN), verify compliance, refactor loopholes before deployment.
Applies TDD RED-GREEN-REFACTOR cycle to create, edit, and verify Claude Code skills: test scenarios first, baseline failures, minimal docs, refactor loopholes.
Share bugs, ideas, or general feedback.
Writing skills IS Test-Driven Development applied to process documentation.
Personal skills live in agent-specific directories (~/.claude/skills for Claude Code, ~/.codex/skills for Codex)
You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
REQUIRED BACKGROUND: You MUST understand shipyard:shipyard-tdd before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill adapts TDD to documentation.
A skill is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.
Skills are: Reusable techniques, patterns, tools, reference guides
Skills are NOT: Narratives about how you solved a problem once
| TDD Concept | Skill Creation |
|---|---|
| Test case | Pressure scenario with subagent |
| Production code | Skill document (SKILL.md) |
| Test fails (RED) | Agent violates rule without skill (baseline) |
| Test passes (GREEN) | Agent complies with skill present |
| Refactor | Close loopholes while maintaining compliance |
| Write test first | Run baseline scenario BEFORE writing skill |
| Watch it fail | Document exact rationalizations agent uses |
| Minimal code | Write skill addressing those specific violations |
| Watch it pass | Verify agent now complies |
| Refactor cycle | Find new rationalizations -> plug -> re-verify |
The entire skill creation process follows RED-GREEN-REFACTOR.
Create when:
Don't create for:
Concrete method with steps to follow (condition-based-waiting, root-cause-tracing)
Way of thinking about problems (flatten-with-flags, test-invariants)
API docs, syntax guides, tool documentation (office docs)
skills/
skill-name/
SKILL.md # Main reference (required)
supporting-file.* # Only if needed
Flat namespace - all skills in one searchable namespace
Separate files for:
Keep inline:
Frontmatter (YAML):
name and descriptionname: Use letters, numbers, and hyphens only (no parentheses, special chars)description: Third-person, describes ONLY when to use (NOT what it does)
---
name: Skill-Name-With-Hyphens
description: Use when [specific triggering conditions and symptoms]
---
# Skill Name
## Overview
What is this? Core principle in 1-2 sentences.
## When to Use
Bullet list with SYMPTOMS and use cases
When NOT to use
## Core Pattern (for techniques/patterns)
Before/after code comparison
## Quick Reference
Table or bullets for scanning common operations
## Implementation
Inline code for simple patterns
| Element | Rule | Example |
|---|---|---|
| Description | Triggering conditions only, no workflow | "Use when tests hang waiting for async ops" |
| Keywords | Error messages, symptoms, tool names | "ENOTEMPTY", "flaky", "bats-core" |
| Name | Verb-first, hyphenated | condition-based-waiting not async-helpers |
| Token budget | <150 words getting-started, <500 words others | Compress, cross-reference |
| Cross-refs | Skill name + requirement marker, no @ links | **REQUIRED SUB-SKILL:** Use shipyard:foo |
Why no @ links: @ syntax force-loads files immediately, consuming context tokens you may not need yet.
Critical CSO rule — Description = When to Use, NOT What the Skill Does: When a description summarizes the skill's workflow, Claude follows the description instead of reading the full skill. A description saying "code review between tasks" caused Claude to do ONE review even though the skill showed TWO stages. Keep descriptions as triggering conditions only.
Use flowcharts ONLY for:
Never use flowcharts for reference material, code examples, linear instructions, or labels without semantic meaning.
One excellent example beats many mediocre ones
Choose most relevant language (testing → TypeScript/JS, system debug → Shell/Python, data → Python).
Good example: complete, runnable, well-commented WHY, from real scenario, ready to adapt.
Don't: implement in 5+ languages, create fill-in-the-blank templates, write contrived examples.
| Pattern | Structure | When |
|---|---|---|
| Self-contained | SKILL.md only | All content fits |
| With reusable tool | SKILL.md + example.ts | Tool is reusable code |
| With heavy reference | SKILL.md + ref.md + scripts/ | Reference too large for inline |
Last week I was working on a project where tests were flaky. I found that using sleep(5) was unreliable. After some experimentation, I discovered that polling for conditions works better...
Narrative style, no activation triggers, no XML structure, not reusable.
</example>
</examples>
<rules>
## The Iron Law (Same as TDD)
NO SKILL WITHOUT A FAILING TEST FIRST
This applies to NEW skills AND EDITS to existing skills.
Write skill before testing? Delete it. Start over.
Edit skill without testing? Same violation.
**No exceptions:**
- Not for "simple additions"
- Not for "just adding a section"
- Not for "documentation updates"
- Don't keep untested changes as "reference"
- Don't "adapt" while running tests
- Delete means delete
**Spirit vs. Letter:** Technically having a test scenario counts as "testing" only if you watched it fail first. Running a test after writing the skill is not RED-GREEN — it's just GREEN. Always establish a failing baseline first.
**REQUIRED BACKGROUND:** The shipyard:shipyard-tdd skill explains why this matters.
## Testing All Skill Types
| Skill Type | Test Approach | Pass Condition |
|-----------|--------------|----------------|
| Discipline-Enforcing | Pressure scenarios (time + sunk cost + exhaustion combined) | Agent follows rule under maximum pressure |
| Technique | Application scenarios + edge-case variations | Agent applies technique correctly to new scenarios |
| Pattern | Recognition + application + counter-examples | Agent identifies when/how AND when NOT to apply |
| Reference | Retrieval tasks + coverage gaps | Agent finds and correctly applies reference info |
## Common Rationalizations for Skipping Testing
| Excuse | Reality |
|--------|---------|
| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
| "Testing is overkill" | Untested skills have issues. Always. 15 min saves hours. |
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
| "No time to test" | Deploying untested wastes more time fixing later. |
## Bulletproofing Skills Against Rationalization
Discipline skills must resist rationalization:
1. **Close every loophole explicitly** -- forbid specific workarounds with "No exceptions" lists
2. **Address "Spirit vs Letter"** -- add early: violating the letter violates the spirit
3. **Build rationalization tables** from baseline testing -- every excuse gets a counter
4. **Create red flags lists** for agent self-check (e.g., "Code before test", "This is different because...")
5. **Update CSO descriptions** with violation symptoms as triggers
## Anti-Patterns
| Anti-Pattern | Symptom | Fix |
|-------------|---------|-----|
| Narrative example | "Last week I found that..." | Rewrite as reusable pattern |
| Multi-language dilution | Same example in 5 languages | Keep only most relevant language |
| Code in flowcharts | Can't copy-paste, hard to read | Use markdown code blocks |
| Generic labels | `helper1`, `step3` have no meaning | Use descriptive names |
</rules>
## RED-GREEN-REFACTOR for Skills
### RED: Write Failing Test (Baseline)
Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
- What choices did they make?
- What rationalizations did they use (verbatim)?
- Which pressures triggered violations?
This is "watch the test fail" - you must see what agents naturally do before writing the skill.
### GREEN: Write Minimal Skill
Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.
Run same scenarios WITH skill. Agent should now comply.
### REFACTOR: Close Loopholes
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
## Skill Creation Checklist (TDD Adapted)
**Use TaskCreate to create tasks for EACH item below.**
**RED:** Create pressure scenarios (3+ for discipline skills) -- run WITHOUT skill -- document baseline rationalizations
**GREEN:**
- [ ] Name: letters/numbers/hyphens only; YAML frontmatter (name + description, max 1024 chars)
- [ ] Description: "Use when..." + triggers/symptoms, third person, no workflow summary
- [ ] Keywords for search, clear overview, address baseline failures from RED
- [ ] Code inline or linked; one excellent example (not multi-language)
- [ ] Run WITH skill -- verify compliance
**REFACTOR:** Find new rationalizations -- add counters -- build rationalization table + red flags -- re-test until bulletproof
**Quality:** Flowcharts only for non-obvious decisions; quick reference table; common mistakes; no narrative; supporting files only for tools/heavy reference
**Deploy:** Commit and push; consider contributing via PR if broadly useful
## Discovery Workflow
How future Claude finds your skill:
1. **Encounters problem** ("tests are flaky")
2. **Finds SKILL** (description matches)
3. **Scans overview** (is this relevant?)
4. **Reads patterns** (quick reference table)
5. **Loads example** (only when implementing)
**Optimize for this flow** - put searchable terms early and often.
## STOP: Before Moving to Next Skill
**After writing ANY skill, you MUST STOP and complete the deployment process.**
**Do NOT:**
- Create multiple skills in batch without testing each
- Move to next skill before current one is verified
- Skip testing because "batching is more efficient"
**The deployment checklist above is MANDATORY for EACH skill.**
Deploying untested skills = deploying untested code.
## Integration
**Called by:** shipyard:shipyard-brainstorming — when new skill patterns are identified during brainstorming
**Pairs with:** shipyard:shipyard-tdd — RED-GREEN-REFACTOR cycle applies to skill creation
**Leads to:** shipyard:shipyard-verification — verify skill triggers and works before deploying