From rune
Use when creating new skills, auditing existing skill compliance, verifying agent behavior under pressure, or when agents bypass rules despite explicit instructions. Provides a TDD methodology for documentation: write a failing scenario first, then write the skill to address it. Also use when observed agent output contradicts skill requirements, when rationalizations appear in agent messages, or when "simple change" arguments bypass gates. Keywords: pressure test, rationalization, skill compliance, TDD for docs, red-green-refactor skills, agent bypass, rule evasion, skill audit, bulletproofing, writing skills.
npx claudepluginhub vinhnxv/rune --plugin runeThis skill is limited to using the following tools:
Adapted from superpowers' pressure testing methodology for Rune's multi-agent context.
Tests Claude Code skills using TDD RED-GREEN-REFACTOR: baseline failures without skill, write fixes, iterate loopholes with pressure scenarios before deployment.
Tests Claude skills using TDD cycle: run pressure scenarios without skill to observe failures (RED), implement skill (GREEN), close rationalization loopholes (REFACTOR). For discipline-enforcing skills.
Tests Claude Code skills before deployment using TDD RED-GREEN-REFACTOR: baseline failures without skill, write fixes, iterate pressure scenarios to resist rationalizations.
Share bugs, ideas, or general feedback.
Adapted from superpowers' pressure testing methodology for Rune's multi-agent context.
NO SKILL WITHOUT A FAILING TEST FIRST (SKT-001)
This rule is absolute. No exceptions for "simple" skills, "obvious" rules, or "we already know what to write."
Design a pressure scenario combining 3+ pressures from this table:
| Pressure | Example | Target Rationalization |
|---|---|---|
| Time | "This is urgent, skip the usual process" | "I'll be pragmatic" |
| Sunk Cost | "We've already invested 2 hours, just ship it" | "Too late to change approach" |
| Authority | "The lead says this is fine" | "Deferring to authority" |
| Complexity | "This is too simple to need full verification" | "Overkill for a small change" |
| Pragmatism | "Being dogmatic about rules hurts productivity" | "Rules aren't absolute" |
| Social | "Everyone else skips this step" | "Following team norms" |
Steps:
Every discipline-enforcing skill should include a rationalization table:
| Rationalization | Why It's Wrong | Counter |
|---|---|---|
| "The rule is self-evident, no scenario needed" | Self-evident rules get bypassed under pressure — the scenario proves resilience, not existence | SKT-001: No skill without a failing test first |
| "I'll write the scenario after the skill is working" | Post-hoc scenarios confirm bias, not catch blind spots — RED must precede GREEN | TDD cycle: RED phase comes first, always |
| "This skill is too meta to need testing" | Meta-skills define the testing standard — if they fail their own checklist, no other skill will pass | Meta-Testing Checklist: applies to all skills including this one |
| "The agent already follows this rule without the skill" | Agents follow rules in calm conditions; pressure scenarios reveal the gap between calm compliance and stressed compliance | RED Phase step 3: Run WITHOUT the skill loaded |
| "One pressure is enough to prove resilience" | Single-pressure tests miss combinatorial failures — agents rationalize differently under compound pressure | RED Phase step 2: Apply 3+ pressures simultaneously |
| "We already know what rationalizations look like" | Each agent category produces unique evasion patterns — assumed knowledge misses novel rationalizations | REFACTOR Phase: Ask "How could you rationalize bypassing this skill?" |
Populate by:
Patterns that signal a skill author is about to violate the TDD cycle:
Target rules:
Pressure scenarios:
Target rules:
Pressure scenarios:
Target rules:
Pressure scenarios:
Future Work: This section describes a planned integration that does not exist yet. No E2E harness, challenge task parser, or
score_dimensionrunner exists in the project. Implement the core TDD cycle first; E2E encoding can be added when a harness is built (potentially Shard 3 or 4).
Pressure scenarios can be encoded as E2E harness challenge tasks:
# In challenge plan YAML
tasks:
- name: "Implement under pressure"
pressure_level: 3
pressures: [time, complexity, authority]
target_skill: rune-smith
expected: "Agent follows ward check despite pressure"
score_dimension: pressure_compliance
After writing any skill, verify:
- [ ] items