Skill

skill-forge

Creates, edits, and verifies Rune skills using TDD: pressure tests first, baseline agent failures without skill, iterate to bulletproof. For Claude Code.

Markdown

Bash

developer-tools

npx claudepluginhub rune-kit/rune --plugin @rune/analytics

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The skill that builds skills. Applies Test-Driven Development to skill authoring: write a pressure test first, watch agents fail without the skill, write the skill to fix those failures, then close loopholes until bulletproof. Ensures every Rune skill is battle-tested before it enters the mesh.

Supporting Assets

references/claude-skill-reference.md

SKILL.md

Similar Skills

skillforge

Analyzes any input to recommend existing Claude Code skills, improve them, or create new ones via triage, 11 thinking models, and multi-agent synthesis. Avoids duplicates.

20 files5 tools

copilot-cli-toolkit

skillforge

Analyzes any input to recommend existing Claude Code skills, improve them, or create new ones via triage of 250+ skills, confidence matching, and multi-agent synthesis.

5 tools

project-toolkit

skill-creator-pro

Creates, modifies, improves, tests, and benchmarks Claude Code skills using category-aware design, gotchas-driven development, eval prompts, and performance analysis.

20 files

skill-creator-pro

Stats

Stars67

Forks18

Last CommitApr 27, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

skill-forge

Purpose

Triggers

/rune skill-forge — manual invocation to create or edit a skill
Auto-trigger: when user says "create a skill", "new skill", "add skill to rune"
Auto-trigger: when editing any skills/*/SKILL.md file

Calls (outbound)

scout (L3): scan existing skills for patterns and naming conventions
plan (L2): structure complex skills with multiple phases
hallucination-guard (L3): verify referenced skills/tools actually exist
verification (L3): validate SKILL.md format compliance
journal (L3): record skill creation decisions in ADR

Called By (inbound)

cook (L1): when the feature being built IS a new skill
scaffold (L1): when scaffolded project includes custom skills

References

references/claude-skill-reference.md — Claude Code skill system: frontmatter fields, variables, shell injection, invocation control matrix, skill type patterns (task/research/knowledge/dynamic), file structure, and quality checklist. Load when creating or editing any skill.

Workflow

Phase 1 — DISCOVER

Before writing anything, understand the landscape:

Scan existing skills via scout — is this already covered?
Check for overlap — will this duplicate or conflict with existing skills?
Identify layer — L1 (orchestrator), L2 (workflow hub), L3 (utility)?
Identify mesh connections — what calls this? What does this call?

If a skill with >70% overlap already exists → extend it, don't create new. The mesh grows stronger by deepening connections, not by adding nodes.

Phase 2 — RED (Baseline Test)

Write the test BEFORE writing the skill.

Create a pressure scenario that exposes the problem the skill solves:

## Pressure Scenario: [skill-name]

### Setup
[Describe the situation an agent faces]

### Pressures (combine 2-3)
- Time pressure: "This is urgent, just do it"
- Sunk cost: "I already wrote 200 lines, can't restart"
- Complexity: "Too many moving parts to follow process"
- Authority: "Senior dev says skip testing"
- Exhaustion: "We're 50 tool calls deep"

### Expected Failure (without skill)
[What the agent will probably do wrong]

### Success Criteria (with skill)
[What the agent should do instead]

Run the scenario with a subagent WITHOUT the skill. Document:

Exact behavior — what did the agent do?
Rationalizations — verbatim excuses for skipping discipline
Failure point — where exactly did it go wrong?

You MUST observe at least one failure before writing the skill. No failure observed = you don't understand the problem well enough to write the solution.

Phase 3 — GREEN (Write Minimal Skill)

Write the SKILL.md addressing ONLY the failures observed in Phase 2.

Follow docs/SKILL-TEMPLATE.md format. Required sections:

Section	Required	Purpose
Frontmatter	YES	Name, description, metadata
Purpose	YES	One paragraph, ecosystem role
Triggers	YES	When to invoke
Calls / Called By	YES	Mesh connections (control flow)
Data Flow	YES	Feeds Into / Fed By / Feedback Loops (data flow)
Workflow	YES	Step-by-step execution
Output Format	YES	Structured, parseable output
Constraints	YES	3-7 MUST/MUST NOT rules
Sharp Edges	YES	Known failure modes
Self-Validation	YES	Domain-specific QA checklist (per-skill, not centralized)
Done When	YES	Verifiable completion criteria
Cost Profile	YES	Token estimate
Mesh Gates	L1/L2 only	Progression guards

SKILL.md Anatomy — WHY vs HOW Split

A skill file answers WHY and WHEN — not HOW. Code examples, syntax references, and implementation patterns belong in separate files:

skills/[name]/
├── SKILL.md          ← WHY: purpose, triggers, constraints, sharp edges (~150-300 lines)
├── references/       ← HOW: code patterns, syntax tables, API examples
│   ├── patterns.md   ← Implementation patterns with code blocks
│   └── gotchas.md    ← Language/framework-specific pitfalls
└── scripts/          ← WHAT: deterministic operations (shell, node)

Rules:

SKILL.md MUST NOT contain code blocks longer than 10 lines — move to references/
One excellent inline example (≤10 lines) is OK for clarity — more than that is a smell
Format templates (Output Format section) are NOT code — they stay in SKILL.md
Pressure test scenarios (Phase 2) are NOT code — they stay in SKILL.md
If a skill has >3 code blocks → create references/ and extract them

Why this matters: Code blocks in SKILL.md inflate context tokens on EVERY invocation. References are loaded only when needed. A 500-line SKILL.md with 200 lines of code examples should be a 300-line SKILL.md + a 200-line references file.

Code blocks in SKILL.md > 10 lines = review failure. Extract to references/ or scripts/. No exceptions.

Frontmatter Rules

---
name: kebab-case-max-64-chars    # letters, numbers, hyphens only
description: Use when [specific triggers]. [Symptoms that signal this skill applies].
metadata:
  layer: L1|L2|L3
  model: haiku|sonnet|opus       # haiku=scan, sonnet=code, opus=architecture
  group: [see template]
---

Description rules (CSO Discipline):

MUST start with "Use when..."
MUST describe triggering conditions, NOT workflow
MUST be third person
MUST NOT summarize what the skill does internally
AI reads description → decides whether to invoke → if description contains workflow summary, AI skips reading the full SKILL.md content (it thinks it already knows)
Test: if you can execute the skill from the description alone, the description leaks too much

Bad: "Analyzes code quality through 6-step process: scan files, check patterns, run linters, compare metrics, generate report, suggest fixes" Good: "Use when code changes need quality review before commit. Symptoms: PR ready, refactor complete, pre-release check."

# BAD: Summarizes workflow — agent reads description, skips full content
description: TDD workflow that writes tests first, then code, then refactors

# GOOD: Only triggers — agent must read full content to know workflow
description: Use when implementing any feature or bugfix, before writing code

Why this matters: When description summarizes the workflow, agents take the shortcut — they follow the description and skip the full SKILL.md. Tested and confirmed.

Writing Constraints

Every constraint MUST block a specific failure mode observed in Phase 2:

# BAD: Generic rule
1. MUST write good code

# GOOD: Blocks specific failure with consequence
1. MUST run tests after each fix — batch-and-pray causes cascading regressions

Anti-Rationalization Table

Capture every excuse from Phase 2 baseline testing:

| Excuse | Reality |
|--------|---------|
| "[verbatim excuse from test]" | [why it's wrong + what to do instead] |

Phase 4 — VERIFY (Green Check)

Run the SAME pressure scenario from Phase 2, now WITH the skill loaded.

Check:

Does the agent follow the skill's workflow?
Are all constraints respected under pressure?
Does the output match the defined format?

If agent still fails with skill loaded → skill is insufficient. Go back to Phase 3, strengthen the weak section. Do NOT ship.

Phase 5 — REFACTOR (Close Loopholes)

Run additional pressure scenarios with varied pressures. For each new failure:

Identify the rationalization
Add it to the anti-rationalization table
Add explicit constraint or sharp edge
Re-run verification

Repeat until no new failures emerge in 2 consecutive test runs.

Pressure Types for Test Scenarios

Best tests combine 3+ pressures simultaneously:

Pressure	Example Scenario
Time	"Emergency deployment, deadline in 30 min"
Sunk cost	"Already wrote 200 lines, can't restart"
Authority	"Senior dev says skip testing"
Economic	"Customer churning, ship now or lose $50k MRR"
Exhaustion	"50 tool calls deep, context filling up"
Social	"Looking dogmatic by insisting on process"
Pragmatic	"Being practical vs being pedantic"

Scenario Quality Requirements

Concrete A/B/C options — force explicit choice (no "I'd ask the user" escape hatch)
Real constraints — specific times, actual consequences, named files
Real file paths — /tmp/payment-system not "a project"
"Make agent ACT" — "What do you do?" not "What should you do?"
No easy outs — every option has a cost

Meta-Testing (When GREEN Isn't Working)

If the agent keeps failing even WITH the skill loaded, ask: "How could that skill have been written differently to make the correct option crystal clear?"

Three possible responses:

"Skill was clear, I chose to ignore it" → foundational principle needed (stronger HARD-GATE)
"Skill should have said X explicitly" → add that exact phrasing verbatim
"I didn't see section Y" → reorganize for discoverability (move up, add header)

Bulletproof Criteria

A skill is bulletproof when:

Agent chooses correct option under maximum pressure (3+ pressures combined)
Agent CITES skill sections as justification for its choice
Agent ACKNOWLEDGES the temptation but follows the rule anyway

Persuasion Principles for Skill Language

Research (Meincke et al., 2025, 28,000 conversations) shows 33% → 72% compliance with these techniques:

Principle	Application	Use For
Authority	"YOU MUST", imperative language	Eliminates decision fatigue, safety-critical rules
Commitment	Explicit announcements + tracked choices	Creates accountability trail
Scarcity	Time-bound requirements, "before proceeding"	Triggers immediate action
Social Proof	"Every time", universal statements	Documents what prevents failures
Unity	"We're building quality" language	Shared identity, quality goals

Prohibited in skills:

Liking ("Great job following the process!") → creates sycophancy
Reciprocity ("I helped you, now follow the rules") → feels manipulative

Ethical test: Would this serve the user's genuine interests if they fully understood the technique?

Phase 5.25 — SCRIPT CONTRACT (skills with helper scripts only)

If the skill bundles executable scripts in its scripts/ directory, those scripts MUST follow the Rune script output contract. This is a testable contract — orchestrators (cook, team, marketing) rely on it for piping and retry logic.

The Three-Mode Contract

Every helper script supports three output modes:

Mode	Stdout	Stderr	File Artifacts
default	One artifact path per line	Diagnostics + warnings	Artifacts in declared out-dir
`--json`	Structured JSON summary	Diagnostics (unchanged)	Artifacts (unchanged)
`--debug`	Default stdout (paths)	Verbose trace + diagnostics	Default + JSONL redacted trace at `<out-dir>/<slug>.jsonl`

Why: default-mode stdout-as-paths is the Unix way. Downstream skills pipe directly without log-parsing. --json is opt-in for callers that need metadata.

Required Flags

Every helper script MUST accept at least these flags:

--help              Print usage + exit 0
--version           Print version + exit 0
--json              Structured JSON on stdout
--debug             Write JSONL redacted trace
--dry-run           Report plan, make no changes, exit 0
--smoke             Pre-flight check (validate deps, exit 0 if healthy)
--out-dir <path>    Override default artifact directory

And SHOULD accept when applicable:

--prompt-file <path>  Read long text input from file (avoids shell-quoting hell on Windows)
--confirm             Skip confirmation gate for expensive/destructive ops
--timeout-ms <n>      Operation timeout (with semantic exit codes below)

Semantic Exit Codes

Adopt the standard Rune exit-code vocabulary:

Code	Meaning	Orchestrator Response
`0`	Success	Accept + chain to next
`1`	Execution failed (retryable)	Log + retry with alternate config
`2`	Usage error (bug)	Abort — don't retry
`3`	Data-integrity error	Halt — don't retry
`4`	Timeout with partial results	Accept partial + continue
`124`	Timeout with zero results	Retry with longer timeout or alternate provider

Codes 5-63 are skill-specific. Document every code used in references/<skill>/exit-codes.md.

Why 4 vs 124 matters: Standard Unix collapses "timeout-with-2-of-3-images" and "timeout-with-0-images" into 124. They are fundamentally different outcomes. Split them.

Default Artifact Directory Resolution

Resolve --out-dir in this fallback order:

--out-dir <path> explicit flag
<SKILL>_OUT_DIR env var (skill-specific)
OPENCLAW_OUTPUT_DIR (OpenClaw platform convention)
OPENCLAW_AGENT_DIR/artifacts/<skill> (OpenClaw default)
OPENCLAW_STATE_DIR/artifacts/<skill> (OpenClaw state fallback)
./.rune/<skill>/ (project-local default)

Why: OpenClaw is one of Rune's adapter targets. Scripts that honor this convention work across adapters without modification.

Sensitive-Data Redaction

--debug trace MUST redact sensitive fields before write:

Regex: /authorization|bearer|token|api[_-]?key|secret|cookie|session[_-]?id|chatgpt[_-]?account/i (key names)
Any value exceeding 500 chars truncates to <first-500>...
Never log env var VALUES — only presence check

Contract Test

Before shipping a helper script, verify:

# Contract smoke test:
node scripts/<script>.mjs --help          # exit 0
node scripts/<script>.mjs --version       # exit 0, prints version only
node scripts/<script>.mjs --smoke         # exit 0 or 1, human-readable stderr
node scripts/<script>.mjs --dry-run ...   # exit 0, no side effects
node scripts/<script>.mjs ... --json      # stdout is parseable JSON
node scripts/<script>.mjs ... | head -1   # stdout default mode = path

Scripts that don't honor the contract cannot be shipped. Specifically: - Mixing paths and progress on stdout = BLOCK - Silent failure (no install guidance on miss) = BLOCK - Logging credentials in trace = CRITICAL-BLOCK - Binary exit code (0/1 only) when timeout semantics apply = BLOCK

Reference implementations:

@rune-pro/media/scripts/codex_imagen_bridge.mjs — full 9-tier binary detection + contract
@rune-pro/media/scripts/provider_probe.mjs — --smoke convention exemplar
@rune-pro/media/scripts/image_optimizer.py — Python contract implementation

Reference docs:

references/image-generator/script-contract.md (pack-level contract)
references/image-generator/exit-codes.md (exit-code vocabulary)
references/image-generator/binary-detection.md (9-tier lookup)

Phase 5.5 — SECURITY MODEL

Every skill that touches external systems, user data, or destructive operations MUST define an explicit Security Model section. This is a contract — not aspirational, but testable.

Add to SKILL.md after Sharp Edges:

## Security Model

### Trust Boundaries
- [What this skill reads] — e.g., "Reads .env files, user source code, git history"
- [What this skill writes] — e.g., "Writes to .rune/ only, never modifies source code"
- [What this skill executes] — e.g., "Runs npm test, never runs arbitrary shell commands"

### This Skill Will NEVER
- [Explicit denial 1] — e.g., "Execute user-provided strings as shell commands"
- [Explicit denial 2] — e.g., "Read or log credential files (.env, secrets.json)"
- [Explicit denial 3] — e.g., "Send data to external endpoints"

### Threat Surface
| Threat | Mitigated By |
|--------|-------------|
| Prompt injection via user input | Input validated before processing |
| Credential exposure in output | Secrets pattern detection before emit |
| Destructive operation on wrong target | Confirmation gate before delete/overwrite |

When to require Security Model:

Skill uses Bash tool → REQUIRED (can execute arbitrary commands)
Skill reads .env or credentials → REQUIRED
Skill writes/deletes files outside .rune/ → REQUIRED
Skill calls external APIs or MCP tools → REQUIRED
Skill is read-only analysis (review, audit, scout) → OPTIONAL but recommended

Eval integration: Phase 7 evals for skills with Security Model MUST include:

E05: Attempt to make skill execute unintended command
E06: Attempt to make skill expose credentials in output
E07: Attempt to make skill write outside its declared boundary

If Security Model is required but missing → Phase 7 EVAL HARD-GATE blocks ship.

Phase 6 — INTEGRATE

Wire the skill into the mesh:

Update docs/ARCHITECTURE.md — add to correct layer/group table
Update CLAUDE.md — increment skill count, add to layer list
Add mesh connections — update SKILL.md of skills that should call/be called by this one
Map data flow — identify which skills consume this skill's output (Feeds Into) and which skills' outputs this skill needs (Fed By). Look for feedback loops where two skills refine each other's work
Write Self-Validation — 3-5 domain-specific checks unique to this skill's output. Ask: "What quality issues can ONLY this skill catch?"
Verify no conflicts — new skill's output format compatible with consumers?

Phase 6.5 — EXTENSION AUTHORING (if building an extension, not a skill)

Extensions augment existing skills with optional capabilities. Unlike skills (standalone workflow units) or packs (domain bundles), extensions ADD features to skills that already exist — without modifying the core skill file.

Extension vs Skill vs Pack

Concept	Purpose	Modifies Core?	Self-contained?
Skill	Standalone workflow unit (SKILL.md)	N/A — IS core	Yes
Pack	Domain bundle of skills (PACK.md)	No — bundles existing	Yes
Extension	Augments existing skill with new capability	No — additive only	Yes — own dir with install/uninstall

Extension Directory Structure

extensions/<extension-name>/
├── EXTENSION.md           # Manifest: what it extends, how, dependencies
├── install.sh             # Unix installer (non-destructive MCP merge)
├── install.ps1            # Windows installer
├── uninstall.sh           # Clean removal
├── uninstall.ps1          # Clean removal (Windows)
├── skills/
│   └── <skill-name>/
│       └── SKILL.md       # New skill added by extension
├── agents/                # Optional subagent definitions
│   └── <agent-name>.md
├── references/            # Domain knowledge loaded by extension skills
│   └── <topic>.md
├── scripts/               # Executable utilities
│   └── <script>.py|.sh
└── docs/
    └── SETUP.md           # Extension-specific configuration guide

EXTENSION.md Manifest

---
name: "<extension-name>"
extends: "<target-skill-or-pack>"
description: "What capability this extension adds"
requires:
  - mcp: "<mcp-server-name>"        # Optional: MCP server dependency
  - skill: "<required-skill-name>"   # Required core skill
install_method: "non-destructive"    # MUST be non-destructive
---

Extension Rules

Non-destructive install — extension MUST NOT modify existing skill files. It adds new files alongside.
Self-contained — removing the extension directory restores the system to its pre-install state.
MCP merge — if the extension adds MCP tools, install script MUST merge into settings.json without overwriting existing entries.
Fallback graceful — if the MCP server or external dependency is unavailable, the extension skill MUST degrade gracefully (report unavailability, don't crash).
Cost awareness — if the extension calls paid APIs, the extension skill MUST warn before expensive operations and track usage.
Pre-flight check — extension skill Step 1 MUST verify dependencies are available before executing.

When to Build an Extension (vs a Skill or Pack)

Build an extension when: the capability requires an external API/MCP, is optional, and augments an existing skill
Build a skill when: the capability is self-contained and fits a layer in the mesh
Build a pack when: you're bundling multiple related skills for a domain

Phase 7 — EVAL (Behavior Tests)

Before shipping, write Eval Scenarios — behavior tests for the SKILL.md itself. These are "unit tests for skill files, not code."

Save evals to skills/<name>/evals.md. Minimum 4 evals per skill:

Eval ID	Category	Required?
E01	Happy path — core workflow	YES
E02	Edge case — unusual/empty input	YES
E03	Adversarial — pressure scenario	YES
E04	Jailbreak/injection attempt	YES for security-critical skills

Each eval follows the format defined in rune:test → "Skill Behavior Tests" section:

Prompt: exact situation the agent faces
Expected Reasoning: step-by-step reasoning agent SHOULD follow
Must Include: what the output MUST contain or do
Must NOT: anti-patterns the output MUST NOT produce

Run each eval with a subagent. An eval FAILS if the agent produces a Must NOT output.

Pre-ship gate: At least E01–E03 must PASS before committing. Security-critical skills (touching auth/secrets/destructive ops) require 8+ evals including jailbreak and credential-leak scenarios.

Also run the Skill Content Security Guard (sentinel Step 3.5) on the new SKILL.md content before commit — blocks destructive ops, prompt injection, and jailbreak patterns embedded in skill instructions.

No evals.md → skill is behavior-untested. Do NOT ship untested skills. Eval file with 0 passing evals = same as no evals.

Phase 8 — SHIP

git add skills/[skill-name]/SKILL.md
git add skills/[skill-name]/evals.md
git add docs/ARCHITECTURE.md CLAUDE.md
# Add any updated existing skills
git commit -m "feat: add [skill-name] — [one-line purpose]"

Skill Quality Checklist

Format:

Name is kebab-case, max 64 chars, letters/numbers/hyphens only
Description starts with "Use when...", does NOT summarize workflow
All template sections present
Constraints are specific (not generic "write good code")
Sharp edges have severity + mitigation

Content:

Baseline test run BEFORE skill was written
At least one observed failure documented
Anti-rationalization table from real test failures
Mesh connections bidirectional (calls AND called-by both updated)
Data flow mapped (Feeds Into / Fed By / Feedback Loops)
Self-Validation has 3-5 domain-specific checks (not generic)
Output format is structured and parseable by other skills
evals.md written with at least 3 passing eval scenarios (E01 happy-path, E02 edge-case, E03 adversarial)
Skill Content Security Guard passed (sentinel Step 3.5 — no destructive ops or injection patterns in SKILL.md)

Architecture:

Layer assignment correct (L1=orchestrate, L2=workflow, L3=utility)
Model assignment correct (haiku=scan, sonnet=code, opus=architect)
No >70% overlap with existing skills
ARCHITECTURE.md updated
CLAUDE.md updated

Extension-specific (if building an extension):

EXTENSION.md manifest present with extends, requires, install_method
install.sh + install.ps1 tested (non-destructive merge)
uninstall.sh + uninstall.ps1 tested (clean removal)
Extension skill has dependency pre-flight check (Step 1)
Fallback behavior documented when external dependency unavailable
Cost warning present if extension calls paid APIs

Adapting Existing Skills

When editing, not creating:

Same TDD cycle applies to edits. 1. Write a test that exposes the gap in the current skill 2. Run baseline — confirm the skill fails on this scenario 3. Edit the skill to address the gap 4. Verify the edit fixes the gap WITHOUT breaking existing behavior

"Just adding a section" is not an excuse to skip testing.

Token Efficiency Guidelines

Skills are loaded into context when invoked. Every word costs tokens.

Skill Type	Target	Notes
L3 utility (haiku)	<300 words	Runs frequently, keep lean
L2 workflow hub	<500 words	Moderate frequency
L1 orchestrator	<800 words	Runs once per workflow
Reference sections	Extract to separate file	>100 lines → own file

Techniques:

Reference --help instead of documenting all flags
Cross-reference other skills instead of repeating content
One excellent example > three mediocre ones
Inline code only if <50 lines, otherwise separate file

Output Format

## Skill Forge Report
- **Skill**: [name] (L[layer])
- **Action**: CREATE | EDIT
- **Status**: SHIPPED | NEEDS_WORK | BLOCKED

### Baseline Test
- Scenario: [test scenario description]
- Result WITHOUT skill: [observed failure]
- Result WITH skill: [observed success or remaining gap]

### Quality Checklist
- Format: [pass/fail count]
- Content: [pass/fail count]
- Architecture: [pass/fail count]

### Files Created/Modified
- skills/[name]/SKILL.md — [created | modified]
- docs/ARCHITECTURE.md — [updated | skipped]
- CLAUDE.md — [updated | skipped]

### Mesh Impact
- New connections: [count] ([list of skills])
- Bidirectional check: PASS | FAIL
- Data flow mapped: [count] feeds-into, [count] fed-by, [count] feedback loops
- Self-Validation: [count] domain-specific checks written

Constraints

MUST run baseline test BEFORE writing skill — no skill without observed failure
MUST verify skill fixes the observed failures — green check required before ship
MUST NOT create skill with >70% overlap with existing — extend instead
MUST follow SKILL-TEMPLATE.md format — all required sections present
MUST update ARCHITECTURE.md and CLAUDE.md on every new skill
MUST NOT ship skill that fails its own pressure test
MUST write description as triggers only — never summarize workflow in description

Sharp Edges

Failure Mode	Severity	Mitigation
Writing skill without baseline test	CRITICAL	Phase 2 HARD-GATE: must observe failure first
Description summarizes workflow → agents skip content	HIGH	Phase 3 description rules: "Use when..." triggers only
New skill duplicates existing skill	HIGH	Phase 1 HARD-GATE: >70% overlap → extend, don't create
Skill passes test but breaks mesh connections	MEDIUM	Phase 6 integration: verify output compatibility
Editing skill without testing the edit	MEDIUM	Adapting section: same TDD cycle for edits
Overly verbose skill burns context tokens	MEDIUM	Token efficiency guidelines: layer-based word targets
Code blocks in SKILL.md bloat every invocation	HIGH	WHY vs HOW split: SKILL.md ≤10-line code blocks, extract rest to references/
Writing skill without TDD (no observed failures first)	CRITICAL	Skill TDD: RED (run scenario WITHOUT skill → document failures) → GREEN (write skill targeting failures) → REFACTOR (find bypasses → add blocks)
Description leaks workflow → agent skips full content	HIGH	CSO Discipline: description = triggers only. Test: can you execute from description alone? If yes, it leaks too much
Self-Validation copies completion-gate checks	HIGH	Self-Validation is DOMAIN-specific: "assertions per test", "dependency ordering". NOT generic: "tests pass", "build succeeds" — those belong to completion-gate
Data Flow confused with Calls	MEDIUM	Calls = runtime invocation (skill A calls skill B). Feeds Into = artifact persistence (skill A writes .rune/X.md, skill B reads it later). If it's a direct function call → Calls. If it's via files/context → Data Flow
Feedback Loop missing one direction	MEDIUM	Every Feedback Loop ↻ must document BOTH directions: what A sends to B AND what B sends back to A. One-way = Feeds Into, not a loop

Done When

Baseline test documented with observed failures (TDD RED phase)
SKILL.md follows template format completely
Skill passes pressure test (agent complies with skill loaded)
No new failures in 2 consecutive varied-pressure test runs
Mesh connections wired (ARCHITECTURE.md, CLAUDE.md, related skills)
Git committed with conventional commit message

Returns

Artifact	Format	Location
New or updated skill file	Markdown (SKILL.md)	`skills/<name>/SKILL.md`
Eval scenarios	Markdown	`skills/<name>/evals.md`
Reference files (if needed)	Markdown	`skills/<name>/references/`
Architecture docs update	Markdown	`docs/ARCHITECTURE.md`
Skill Forge Report	Markdown	inline

Cost Profile

~3000-8000 tokens per skill creation (opus for Phase 2-5 reasoning, haiku for scout/verification). Most cost is in the iterative test-refine loop (Phase 4-5). Budget 2-4 test iterations per skill.

Scope guardrail: skill-forge authors and tests skill files — it does not implement the features those skills describe.