Test Skill

Run TDD-style pressure test scenarios against marketplace skills.

Contextd Integration (Optional)

If contextd MCP is available:

memory_record for test results persistence
memory_search for past test patterns

If contextd is NOT available:

Test results output to terminal only
Results can be saved to .claude/test-results/<skill>-<date>.md

Process

1. Load Scenarios

Read scenarios from skills/<skill-name>/tests/scenarios.md.

Parse into structured format:

scenario name
pressure types
context (the prompt)
correct behavior
rationalization red flags

2. Run Each Scenario

For each scenario (or specific --scenario if provided):

If --baseline flag (RED phase):

Dispatch subagent WITHOUT loading the skill.
This tests what agents naturally do without guidance.
Document exact choices and rationalizations verbatim.

If normal mode (GREEN phase):

Dispatch subagent WITH skill in context.
Verify agent follows the skill's guidance.
Check response against correct behavior.
Flag any rationalization red flags found.

3. Dispatch Subagent

Use Task tool with prompt:

${scenario.context}

[If not --baseline, include:]
You have access to the following skill:
@skills/<skill-name>/SKILL.md

4. Analyze Response

Check subagent response:

Did they choose the correct option (A, B, or C)?
Did they cite the skill appropriately?
Did they use any rationalization red flags?

5. Record Results

On failure (wrong choice or rationalization):

mcp__contextd__memory_record(
  project_id: "marketplace",
  title: "Rationalization: <scenario-name>",
  content: "Skill: <skill>
            Scenario: <scenario>
            Expected: <correct-choice>
            Actual: <agent-choice>
            Rationalization: '<verbatim phrase>'
            Pressures: <types>",
  outcome: "failure",
  tags: ["pressure-test", "rationalization", "<skill>", "<scenario>"]
)

On pass:

mcp__contextd__memory_record(
  project_id: "marketplace",
  title: "Passed: <scenario-name>",
  content: "Skill: <skill>
            Scenario: <scenario>
            Correctly chose: <choice>
            Cited skill section: <yes/no>",
  outcome: "success",
  tags: ["pressure-test", "compliance", "<skill>"]
)

6. Report Summary

Output table:

## Test Results: <skill-name>

| Scenario | Result | Choice | Red Flags |
|----------|--------|--------|-----------|
| skip-consensus-simple-fix | PASS | A | - |
| ignore-security-veto | FAIL | B | "CEO approved" |
...

**Summary:** X/Y passed
**Failures recorded to contextd**

Usage Examples

# Test all git-workflows scenarios
/test-skill git-workflows

# Baseline test (RED phase - no skill)
/test-skill git-workflows --baseline

# Test specific scenario
/test-skill git-workflows --scenario=skip-consensus-simple-fix

# Test all skills
/test-skill git-workflows
/test-skill git-repo-standards
/test-skill project-onboarding

Creating Todos on Failure

When scenarios fail, create TodoWrite items:

- [ ] Fix skill: <skill-name> - scenario <scenario> triggered rationalization: "<phrase>"

This ensures failures become actionable work items.

/test-skill