From rune
Creates, edits, and verifies Rune skills using TDD: pressure tests first, baseline agent failures without skill, iterate to bulletproof. For Claude Code.
npx claudepluginhub rune-kit/rune --plugin @rune/analyticsThis skill uses the workspace's default tool permissions.
The skill that builds skills. Applies Test-Driven Development to skill authoring: write a pressure test first, watch agents fail without the skill, write the skill to fix those failures, then close loopholes until bulletproof. Ensures every Rune skill is battle-tested before it enters the mesh.
Analyzes any input to recommend existing Claude Code skills, improve them, or create new ones via triage, 11 thinking models, and multi-agent synthesis. Avoids duplicates.
Analyzes any input to recommend existing Claude Code skills, improve them, or create new ones via triage of 250+ skills, confidence matching, and multi-agent synthesis.
Creates, modifies, improves, tests, and benchmarks Claude Code skills using category-aware design, gotchas-driven development, eval prompts, and performance analysis.
Share bugs, ideas, or general feedback.
The skill that builds skills. Applies Test-Driven Development to skill authoring: write a pressure test first, watch agents fail without the skill, write the skill to fix those failures, then close loopholes until bulletproof. Ensures every Rune skill is battle-tested before it enters the mesh.
/rune skill-forge — manual invocation to create or edit a skillskills/*/SKILL.md filescout (L3): scan existing skills for patterns and naming conventionsplan (L2): structure complex skills with multiple phaseshallucination-guard (L3): verify referenced skills/tools actually existverification (L3): validate SKILL.md format compliancejournal (L3): record skill creation decisions in ADRcook (L1): when the feature being built IS a new skillscaffold (L1): when scaffolded project includes custom skillsreferences/claude-skill-reference.md — Claude Code skill system: frontmatter fields, variables, shell injection, invocation control matrix, skill type patterns (task/research/knowledge/dynamic), file structure, and quality checklist. Load when creating or editing any skill.Before writing anything, understand the landscape:
scout — is this already covered?Write the test BEFORE writing the skill.
Create a pressure scenario that exposes the problem the skill solves:
## Pressure Scenario: [skill-name]
### Setup
[Describe the situation an agent faces]
### Pressures (combine 2-3)
- Time pressure: "This is urgent, just do it"
- Sunk cost: "I already wrote 200 lines, can't restart"
- Complexity: "Too many moving parts to follow process"
- Authority: "Senior dev says skip testing"
- Exhaustion: "We're 50 tool calls deep"
### Expected Failure (without skill)
[What the agent will probably do wrong]
### Success Criteria (with skill)
[What the agent should do instead]
Run the scenario with a subagent WITHOUT the skill. Document:
Write the SKILL.md addressing ONLY the failures observed in Phase 2.
Follow docs/SKILL-TEMPLATE.md format. Required sections:
| Section | Required | Purpose |
|---|---|---|
| Frontmatter | YES | Name, description, metadata |
| Purpose | YES | One paragraph, ecosystem role |
| Triggers | YES | When to invoke |
| Calls / Called By | YES | Mesh connections (control flow) |
| Data Flow | YES | Feeds Into / Fed By / Feedback Loops (data flow) |
| Workflow | YES | Step-by-step execution |
| Output Format | YES | Structured, parseable output |
| Constraints | YES | 3-7 MUST/MUST NOT rules |
| Sharp Edges | YES | Known failure modes |
| Self-Validation | YES | Domain-specific QA checklist (per-skill, not centralized) |
| Done When | YES | Verifiable completion criteria |
| Cost Profile | YES | Token estimate |
| Mesh Gates | L1/L2 only | Progression guards |
A skill file answers WHY and WHEN — not HOW. Code examples, syntax references, and implementation patterns belong in separate files:
skills/[name]/
├── SKILL.md ← WHY: purpose, triggers, constraints, sharp edges (~150-300 lines)
├── references/ ← HOW: code patterns, syntax tables, API examples
│ ├── patterns.md ← Implementation patterns with code blocks
│ └── gotchas.md ← Language/framework-specific pitfalls
└── scripts/ ← WHAT: deterministic operations (shell, node)
Rules:
references/references/ and extract themWhy this matters: Code blocks in SKILL.md inflate context tokens on EVERY invocation. References are loaded only when needed. A 500-line SKILL.md with 200 lines of code examples should be a 300-line SKILL.md + a 200-line references file.
Code blocks in SKILL.md > 10 lines = review failure. Extract to references/ or scripts/. No exceptions.---
name: kebab-case-max-64-chars # letters, numbers, hyphens only
description: Use when [specific triggers]. [Symptoms that signal this skill applies].
metadata:
layer: L1|L2|L3
model: haiku|sonnet|opus # haiku=scan, sonnet=code, opus=architecture
group: [see template]
---
Description rules (CSO Discipline):
Bad: "Analyzes code quality through 6-step process: scan files, check patterns, run linters, compare metrics, generate report, suggest fixes" Good: "Use when code changes need quality review before commit. Symptoms: PR ready, refactor complete, pre-release check."
# BAD: Summarizes workflow — agent reads description, skips full content
description: TDD workflow that writes tests first, then code, then refactors
# GOOD: Only triggers — agent must read full content to know workflow
description: Use when implementing any feature or bugfix, before writing code
Why this matters: When description summarizes the workflow, agents take the shortcut — they follow the description and skip the full SKILL.md. Tested and confirmed.
Every constraint MUST block a specific failure mode observed in Phase 2:
# BAD: Generic rule
1. MUST write good code
# GOOD: Blocks specific failure with consequence
1. MUST run tests after each fix — batch-and-pray causes cascading regressions
Capture every excuse from Phase 2 baseline testing:
| Excuse | Reality |
|--------|---------|
| "[verbatim excuse from test]" | [why it's wrong + what to do instead] |
Run the SAME pressure scenario from Phase 2, now WITH the skill loaded.
Check:
Run additional pressure scenarios with varied pressures. For each new failure:
Repeat until no new failures emerge in 2 consecutive test runs.
Best tests combine 3+ pressures simultaneously:
| Pressure | Example Scenario |
|---|---|
| Time | "Emergency deployment, deadline in 30 min" |
| Sunk cost | "Already wrote 200 lines, can't restart" |
| Authority | "Senior dev says skip testing" |
| Economic | "Customer churning, ship now or lose $50k MRR" |
| Exhaustion | "50 tool calls deep, context filling up" |
| Social | "Looking dogmatic by insisting on process" |
| Pragmatic | "Being practical vs being pedantic" |
/tmp/payment-system not "a project"If the agent keeps failing even WITH the skill loaded, ask: "How could that skill have been written differently to make the correct option crystal clear?"
Three possible responses:
A skill is bulletproof when:
Research (Meincke et al., 2025, 28,000 conversations) shows 33% → 72% compliance with these techniques:
| Principle | Application | Use For |
|---|---|---|
| Authority | "YOU MUST", imperative language | Eliminates decision fatigue, safety-critical rules |
| Commitment | Explicit announcements + tracked choices | Creates accountability trail |
| Scarcity | Time-bound requirements, "before proceeding" | Triggers immediate action |
| Social Proof | "Every time", universal statements | Documents what prevents failures |
| Unity | "We're building quality" language | Shared identity, quality goals |
Prohibited in skills:
Ethical test: Would this serve the user's genuine interests if they fully understood the technique?
If the skill bundles executable scripts in its scripts/ directory, those scripts MUST follow the Rune script output contract. This is a testable contract — orchestrators (cook, team, marketing) rely on it for piping and retry logic.
Every helper script supports three output modes:
| Mode | Stdout | Stderr | File Artifacts |
|---|---|---|---|
| default | One artifact path per line | Diagnostics + warnings | Artifacts in declared out-dir |
--json | Structured JSON summary | Diagnostics (unchanged) | Artifacts (unchanged) |
--debug | Default stdout (paths) | Verbose trace + diagnostics | Default + JSONL redacted trace at <out-dir>/<slug>.jsonl |
Why: default-mode stdout-as-paths is the Unix way. Downstream skills pipe directly without log-parsing. --json is opt-in for callers that need metadata.
Every helper script MUST accept at least these flags:
--help Print usage + exit 0
--version Print version + exit 0
--json Structured JSON on stdout
--debug Write JSONL redacted trace
--dry-run Report plan, make no changes, exit 0
--smoke Pre-flight check (validate deps, exit 0 if healthy)
--out-dir <path> Override default artifact directory
And SHOULD accept when applicable:
--prompt-file <path> Read long text input from file (avoids shell-quoting hell on Windows)
--confirm Skip confirmation gate for expensive/destructive ops
--timeout-ms <n> Operation timeout (with semantic exit codes below)
Adopt the standard Rune exit-code vocabulary:
| Code | Meaning | Orchestrator Response |
|---|---|---|
0 | Success | Accept + chain to next |
1 | Execution failed (retryable) | Log + retry with alternate config |
2 | Usage error (bug) | Abort — don't retry |
3 | Data-integrity error | Halt — don't retry |
4 | Timeout with partial results | Accept partial + continue |
124 | Timeout with zero results | Retry with longer timeout or alternate provider |
Codes 5-63 are skill-specific. Document every code used in references/<skill>/exit-codes.md.
Why 4 vs 124 matters: Standard Unix collapses "timeout-with-2-of-3-images" and "timeout-with-0-images" into 124. They are fundamentally different outcomes. Split them.
Resolve --out-dir in this fallback order:
--out-dir <path> explicit flag<SKILL>_OUT_DIR env var (skill-specific)OPENCLAW_OUTPUT_DIR (OpenClaw platform convention)OPENCLAW_AGENT_DIR/artifacts/<skill> (OpenClaw default)OPENCLAW_STATE_DIR/artifacts/<skill> (OpenClaw state fallback)./.rune/<skill>/ (project-local default)Why: OpenClaw is one of Rune's adapter targets. Scripts that honor this convention work across adapters without modification.
--debug trace MUST redact sensitive fields before write:
/authorization|bearer|token|api[_-]?key|secret|cookie|session[_-]?id|chatgpt[_-]?account/i (key names)<first-500>...Before shipping a helper script, verify:
# Contract smoke test:
node scripts/<script>.mjs --help # exit 0
node scripts/<script>.mjs --version # exit 0, prints version only
node scripts/<script>.mjs --smoke # exit 0 or 1, human-readable stderr
node scripts/<script>.mjs --dry-run ... # exit 0, no side effects
node scripts/<script>.mjs ... --json # stdout is parseable JSON
node scripts/<script>.mjs ... | head -1 # stdout default mode = path
Scripts that don't honor the contract cannot be shipped.
Specifically:
- Mixing paths and progress on stdout = BLOCK
- Silent failure (no install guidance on miss) = BLOCK
- Logging credentials in trace = CRITICAL-BLOCK
- Binary exit code (0/1 only) when timeout semantics apply = BLOCK
Reference implementations:
@rune-pro/media/scripts/codex_imagen_bridge.mjs — full 9-tier binary detection + contract@rune-pro/media/scripts/provider_probe.mjs — --smoke convention exemplar@rune-pro/media/scripts/image_optimizer.py — Python contract implementationReference docs:
references/image-generator/script-contract.md (pack-level contract)references/image-generator/exit-codes.md (exit-code vocabulary)references/image-generator/binary-detection.md (9-tier lookup)Every skill that touches external systems, user data, or destructive operations MUST define an explicit Security Model section. This is a contract — not aspirational, but testable.
Add to SKILL.md after Sharp Edges:
## Security Model
### Trust Boundaries
- [What this skill reads] — e.g., "Reads .env files, user source code, git history"
- [What this skill writes] — e.g., "Writes to .rune/ only, never modifies source code"
- [What this skill executes] — e.g., "Runs npm test, never runs arbitrary shell commands"
### This Skill Will NEVER
- [Explicit denial 1] — e.g., "Execute user-provided strings as shell commands"
- [Explicit denial 2] — e.g., "Read or log credential files (.env, secrets.json)"
- [Explicit denial 3] — e.g., "Send data to external endpoints"
### Threat Surface
| Threat | Mitigated By |
|--------|-------------|
| Prompt injection via user input | Input validated before processing |
| Credential exposure in output | Secrets pattern detection before emit |
| Destructive operation on wrong target | Confirmation gate before delete/overwrite |
When to require Security Model:
Bash tool → REQUIRED (can execute arbitrary commands).env or credentials → REQUIRED.rune/ → REQUIREDEval integration: Phase 7 evals for skills with Security Model MUST include:
If Security Model is required but missing → Phase 7 EVAL HARD-GATE blocks ship.
Wire the skill into the mesh:
docs/ARCHITECTURE.md — add to correct layer/group tableCLAUDE.md — increment skill count, add to layer listExtensions augment existing skills with optional capabilities. Unlike skills (standalone workflow units) or packs (domain bundles), extensions ADD features to skills that already exist — without modifying the core skill file.
| Concept | Purpose | Modifies Core? | Self-contained? |
|---|---|---|---|
| Skill | Standalone workflow unit (SKILL.md) | N/A — IS core | Yes |
| Pack | Domain bundle of skills (PACK.md) | No — bundles existing | Yes |
| Extension | Augments existing skill with new capability | No — additive only | Yes — own dir with install/uninstall |
extensions/<extension-name>/
├── EXTENSION.md # Manifest: what it extends, how, dependencies
├── install.sh # Unix installer (non-destructive MCP merge)
├── install.ps1 # Windows installer
├── uninstall.sh # Clean removal
├── uninstall.ps1 # Clean removal (Windows)
├── skills/
│ └── <skill-name>/
│ └── SKILL.md # New skill added by extension
├── agents/ # Optional subagent definitions
│ └── <agent-name>.md
├── references/ # Domain knowledge loaded by extension skills
│ └── <topic>.md
├── scripts/ # Executable utilities
│ └── <script>.py|.sh
└── docs/
└── SETUP.md # Extension-specific configuration guide
---
name: "<extension-name>"
extends: "<target-skill-or-pack>"
description: "What capability this extension adds"
requires:
- mcp: "<mcp-server-name>" # Optional: MCP server dependency
- skill: "<required-skill-name>" # Required core skill
install_method: "non-destructive" # MUST be non-destructive
---
Before shipping, write Eval Scenarios — behavior tests for the SKILL.md itself. These are "unit tests for skill files, not code."
Save evals to skills/<name>/evals.md. Minimum 4 evals per skill:
| Eval ID | Category | Required? |
|---|---|---|
| E01 | Happy path — core workflow | YES |
| E02 | Edge case — unusual/empty input | YES |
| E03 | Adversarial — pressure scenario | YES |
| E04 | Jailbreak/injection attempt | YES for security-critical skills |
Each eval follows the format defined in rune:test → "Skill Behavior Tests" section:
Run each eval with a subagent. An eval FAILS if the agent produces a Must NOT output.
Pre-ship gate: At least E01–E03 must PASS before committing. Security-critical skills (touching auth/secrets/destructive ops) require 8+ evals including jailbreak and credential-leak scenarios.
Also run the Skill Content Security Guard (sentinel Step 3.5) on the new SKILL.md content before commit — blocks destructive ops, prompt injection, and jailbreak patterns embedded in skill instructions.
No evals.md → skill is behavior-untested. Do NOT ship untested skills. Eval file with 0 passing evals = same as no evals.git add skills/[skill-name]/SKILL.md
git add skills/[skill-name]/evals.md
git add docs/ARCHITECTURE.md CLAUDE.md
# Add any updated existing skills
git commit -m "feat: add [skill-name] — [one-line purpose]"
Format:
Content:
evals.md written with at least 3 passing eval scenarios (E01 happy-path, E02 edge-case, E03 adversarial)Architecture:
Extension-specific (if building an extension):
When editing, not creating:
Same TDD cycle applies to edits. 1. Write a test that exposes the gap in the current skill 2. Run baseline — confirm the skill fails on this scenario 3. Edit the skill to address the gap 4. Verify the edit fixes the gap WITHOUT breaking existing behavior"Just adding a section" is not an excuse to skip testing.
Skills are loaded into context when invoked. Every word costs tokens.
| Skill Type | Target | Notes |
|---|---|---|
| L3 utility (haiku) | <300 words | Runs frequently, keep lean |
| L2 workflow hub | <500 words | Moderate frequency |
| L1 orchestrator | <800 words | Runs once per workflow |
| Reference sections | Extract to separate file | >100 lines → own file |
Techniques:
--help instead of documenting all flags## Skill Forge Report
- **Skill**: [name] (L[layer])
- **Action**: CREATE | EDIT
- **Status**: SHIPPED | NEEDS_WORK | BLOCKED
### Baseline Test
- Scenario: [test scenario description]
- Result WITHOUT skill: [observed failure]
- Result WITH skill: [observed success or remaining gap]
### Quality Checklist
- Format: [pass/fail count]
- Content: [pass/fail count]
- Architecture: [pass/fail count]
### Files Created/Modified
- skills/[name]/SKILL.md — [created | modified]
- docs/ARCHITECTURE.md — [updated | skipped]
- CLAUDE.md — [updated | skipped]
### Mesh Impact
- New connections: [count] ([list of skills])
- Bidirectional check: PASS | FAIL
- Data flow mapped: [count] feeds-into, [count] fed-by, [count] feedback loops
- Self-Validation: [count] domain-specific checks written
| Failure Mode | Severity | Mitigation |
|---|---|---|
| Writing skill without baseline test | CRITICAL | Phase 2 HARD-GATE: must observe failure first |
| Description summarizes workflow → agents skip content | HIGH | Phase 3 description rules: "Use when..." triggers only |
| New skill duplicates existing skill | HIGH | Phase 1 HARD-GATE: >70% overlap → extend, don't create |
| Skill passes test but breaks mesh connections | MEDIUM | Phase 6 integration: verify output compatibility |
| Editing skill without testing the edit | MEDIUM | Adapting section: same TDD cycle for edits |
| Overly verbose skill burns context tokens | MEDIUM | Token efficiency guidelines: layer-based word targets |
| Code blocks in SKILL.md bloat every invocation | HIGH | WHY vs HOW split: SKILL.md ≤10-line code blocks, extract rest to references/ |
| Writing skill without TDD (no observed failures first) | CRITICAL | Skill TDD: RED (run scenario WITHOUT skill → document failures) → GREEN (write skill targeting failures) → REFACTOR (find bypasses → add blocks) |
| Description leaks workflow → agent skips full content | HIGH | CSO Discipline: description = triggers only. Test: can you execute from description alone? If yes, it leaks too much |
| Self-Validation copies completion-gate checks | HIGH | Self-Validation is DOMAIN-specific: "assertions per test", "dependency ordering". NOT generic: "tests pass", "build succeeds" — those belong to completion-gate |
| Data Flow confused with Calls | MEDIUM | Calls = runtime invocation (skill A calls skill B). Feeds Into = artifact persistence (skill A writes .rune/X.md, skill B reads it later). If it's a direct function call → Calls. If it's via files/context → Data Flow |
| Feedback Loop missing one direction | MEDIUM | Every Feedback Loop ↻ must document BOTH directions: what A sends to B AND what B sends back to A. One-way = Feeds Into, not a loop |
| Artifact | Format | Location |
|---|---|---|
| New or updated skill file | Markdown (SKILL.md) | skills/<name>/SKILL.md |
| Eval scenarios | Markdown | skills/<name>/evals.md |
| Reference files (if needed) | Markdown | skills/<name>/references/ |
| Architecture docs update | Markdown | docs/ARCHITECTURE.md |
| Skill Forge Report | Markdown | inline |
~3000-8000 tokens per skill creation (opus for Phase 2-5 reasoning, haiku for scout/verification). Most cost is in the iterative test-refine loop (Phase 4-5). Budget 2-4 test iterations per skill.
Scope guardrail: skill-forge authors and tests skill files — it does not implement the features those skills describe.