npx claudepluginhub xoai/sageThis skill uses the workspace's default tool permissions.
Run test prompts and record what agents get wrong.
Validates packs by running automated quality checks, re-running test prompts with pack loaded, measuring behavior change against baseline, and reporting if it earns context tokens.
Tests AI agents via TDD cycle: run with test inputs, verify outputs for accuracy and structure, iterate until robust. Use before deploying or editing agents.
Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, and production monitoring to assess real-world performance.
Share bugs, ideas, or general feedback.
Run test prompts and record what agents get wrong.
Core Principle: Packs are evidence-based. Every pattern and anti-pattern should be traceable to an observed agent failure. This phase creates that evidence. Skip this for project overlays (overlays are based on team rules, not agent observations).
Using the agent failures identified in Phase 1 (pack-discover) and the processed sources from Phase 2 (pack-source-process), design 3-5 test prompts.
Each prompt should:
## Test Prompt 1: [descriptive name]
Target: [which failure this should expose]
Prompt: "[exact text to give the agent]"
Expected failure: [what the agent will probably get wrong]
This is the critical step. Run each test prompt and record the ACTUAL output. The agent should work on these prompts in a clean context — no pack loaded, no special instructions about the framework.
For each prompt, record:
Compare actual failures against expected failures:
The last case is critical. If the agent already handles something well, a pattern for it is noise. Cut it aggressively.
Record findings in a structured format that directly feeds Phase 4 (drafting).
Save to .sage/pack-build/observations.md:
# Observation Report
## Test Results
### Prompt 1: [name]
Failure observed: YES/NO
What happened: [description]
Severity: [critical/major/minor/none]
Pattern candidate: [yes/no — drop if no failure]
### Prompt 2: [name]
...
## Confirmed Failures (will become anti-patterns)
1. [failure] — observed in prompt [N]
2. ...
## Confirmed Needs (will become patterns)
1. [what the agent needs to be told] — addresses failure [N]
2. ...
## Dropped Candidates (agent handles correctly)
- [candidate] — reason dropped