npx claudepluginhub mathews-tom/armory --plugin armoryThis skill uses the workspace's default tool permissions.
Replaces trial-and-error prompt engineering with structured methodology: objective
Implements Playwright E2E testing patterns: Page Object Model, test organization, configuration, reporters, artifacts, and CI/CD integration for stable suites.
Guides Next.js 16+ Turbopack for faster dev via incremental bundling, FS caching, and HMR; covers webpack comparison, bundle analysis, and production builds.
Discovers and evaluates Laravel packages via LaraPlugins.io MCP. Searches by keyword/feature, filters by health score, Laravel/PHP compatibility; fetches details, metrics, and version history.
Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.
| File | Contents | Load When |
|---|---|---|
references/prompt-patterns.md | Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
references/evaluation-metrics.md | Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
references/failure-modes.md | Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
references/output-constraints.md | Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
If an existing prompt is provided:
references/failure-modes.md)
apply to this prompt?Create 2-4 prompt variants, each testing a different hypothesis:
| Variant Type | Hypothesis | When to Use |
|---|---|---|
| Direct instruction | Clear instruction is sufficient | Simple tasks, capable models |
| Few-shot | Examples improve output consistency | Pattern-following tasks |
| Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis |
| Persona/role | Role framing improves tone/expertise | Domain-specific tasks |
| Structured output | Format specification prevents errors | JSON, CSV, specific templates |
For each variant:
Rubric — Define weighted criteria:
| Criterion | What It Measures | Typical Weight |
|---|---|---|
| Correctness | Output matches expected answer | 30-50% |
| Format compliance | Follows specified structure | 15-25% |
| Completeness | All required elements present | 15-25% |
| Conciseness | No unnecessary content | 5-15% |
| Tone/style | Matches requested voice | 5-10% |
Test cases — Minimum 5 cases covering:
Present variants, rubric, and test cases in a structured format ready for execution.
## Prompt Lab: {Task Name}
### Objective
{What the prompt should achieve — specific and measurable}
### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}
### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}
### Variants
#### Variant A: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant B: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant C: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
### Evaluation Rubric
| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
### Test Cases
| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements
| Problem | Resolution |
|---|---|
| No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. |
| Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. |
| Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |
Push back if: