npx claudepluginhub mathews-tom/armory --plugin armoryThis skill uses the workspace's default tool permissions.
Generate structured test assertions and failure diagnostics for skill packages through
Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.
Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.
Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.
Generate structured test assertions and failure diagnostics for skill packages through information-isolated verification. The verifier operates without access to the skill generator's reasoning — it sees only the skill definition, a task prompt, and the output artifacts. This isolation prevents confirmation bias and is the single largest contributor to skill quality in co-evolutionary generation (+30pp per EvoSkills).
| File | Contents | Load When |
|---|---|---|
references/assertion-patterns.md | Assertion catalog by skill category with weight guidance | Always |
references/diagnostic-templates.md | Failure diagnostic templates with root-cause categories | When producing failure reports |
This is the most critical constraint. Violating isolation degrades verification quality.
The verifier MUST NOT access:
The verifier receives ONLY:
SKILL.md content (the definition file)scripts/eval_assertions.py (when diagnosing)Implementation: When invoked by the test-engineer agent, this skill MUST be loaded
into a separate Agent spawn using isolation: "worktree" or at minimum a fresh session
with no shared context. The invoking agent passes artifacts as explicit text, not as
conversation references.
Generate assertions for a skill given its definition and task prompts.
Read the SKILL.md definition and extract:
For each task prompt, generate 5-10 assertions covering these dimensions:
| Dimension | Assertion Types to Use | Purpose |
|---|---|---|
| Output completeness | contains, matches_regex | All claimed sections/components present |
| Format compliance | output_format, contains | Output matches declared structure |
| Factual signals | contains, not_contains | Key domain terms present, hallmarks absent |
| Tool usage | calls_tool | Expected tools were invoked |
| Negative constraints | not_contains | Forbidden patterns absent |
Weight assignment:
See references/assertion-patterns.md for category-specific assertion catalogs.
Produce assertions in the evals/cases.yaml schema format:
assertions:
- type: contains
target: "## Scalability"
weight: 1.0
- type: output_format
target: markdown_table
weight: 0.8
- type: not_contains
target: "TODO"
weight: 0.3
- type: calls_tool
target: Read
weight: 0.5
Context cap: Do not consume more than 70% of the available context window. If the skill definition is very long, focus assertion generation on the workflow phases and output format sections. Summarize rather than quote verbatim.
When an oracle returns fail, produce a structured diagnostic explaining why.
SKILL.md (same as Mode 1)Categorize each failed assertion into a root-cause category:
| Category | Signal | Severity |
|---|---|---|
| Missing capability | contains assertion failed for a claimed feature | HIGH |
| Format mismatch | output_format assertion failed | HIGH |
| Incomplete output | Multiple contains assertions failed in the same section | MEDIUM |
| Hallucinated content | not_contains assertion failed (forbidden pattern present) | HIGH |
| Wrong tool usage | calls_tool assertion failed | MEDIUM |
| Partial success | Some assertions in a group pass, others fail | LOW |
For each failed assertion:
SKILL.md that promises the missing capabilityFor each root cause, produce a concrete, actionable fix:
Produce a structured diagnostic string:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]
FAILED ASSERTIONS (N/M):
1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]
ROOT CAUSES:
- [category]: [specific explanation with SKILL.md section reference]
REMEDIATION:
1. [Concrete change to SKILL.md with exact section and wording]
2. [Concrete change to workflow with step numbers]
See references/diagnostic-templates.md for worked examples per root-cause category.
Per EvoSkills Algorithm 1:
The verifier does not track its own budget — the test-engineer agent manages iteration limits.
scripts/run_evals.py (the oracle).scripts/eval_assertions.py.| Error | Resolution |
|---|---|
| Skill definition too large | Summarize to workflow phases + output format sections only |
| No assertions generatable | Return empty assertions list with warning; skill may be too vague |
| Ambiguous output format | Default to contains assertions; avoid output_format checks |
| Context cap exceeded | Truncate diagnostic detail; preserve failed assertion list |