How this skill is triggered — by the user, by Claude, or both
Slash command
/fuse-prompt-engineer:prompt-testingThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Skill for testing, comparing, and measuring prompt performance.
Skill for testing, comparing, and measuring prompt performance.
1. DEFINE
└── Test objective
└── Metrics to measure
└── Success criteria
2. PREPARE
└── Variants A and B
└── Test dataset
└── Baseline (if existing)
3. EXECUTE
└── Run on dataset
└── Collect results
└── Document observations
4. ANALYZE
└── Calculate metrics
└── Compare variants
└── Identify patterns
5. DECIDE
└── Recommendation
└── Statistical confidence
└── Next iterations
| Metric | Description | Calculation |
|---|---|---|
| Accuracy | Correct responses | Correct / Total |
| Compliance | Format adherence | Compliant / Total |
| Consistency | Response stability | 1 - Variance |
| Relevance | Meeting the need | Average score (1-5) |
| Metric | Description | Calculation |
|---|---|---|
| Tokens Input | Prompt size | Token count |
| Tokens Output | Response size | Token count |
| Latency | Response time | ms |
| Cost | Price per request | Tokens × Price |
| Metric | Description | Calculation |
|---|---|---|
| Edge Cases | Edge case handling | Passed / Total |
| Jailbreak Resist | Bypass resistance | Blocked / Attempts |
| Error Recovery | Error recovery | Recovered / Errors |
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
# Create a test
/prompt test create --name "Test v1" --dataset tests.json
# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
# View results
/prompt test results --id test_001
# Compare two tests
/prompt test compare --tests test_001,test_002
IF:
- Accuracy B >= Accuracy A
AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
AND no regression on edge cases
THEN:
→ Adopt B
ELSE IF:
- Accuracy improvement > 10%
AND token regression < 20%
THEN:
→ Consider B (acceptable trade-off)
ELSE:
→ Keep A or iterate
npx claudepluginhub fusengine/agents --plugin fuse-prompt-engineerDesigns, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.
Designs test cases, adversarial inputs, and iterates on prompts based on eval results. Useful for prompt-engineering tasks like drafting, testing, and refining prompts and skills.
Guides versioning prompts like code with Git, testing changes via regression/A/B tests, and managing deployments/rollbacks. Useful for prompt engineering workflows to track iterations and avoid regressions.