Skill

prompt-testing

From fuse-prompt-engineer

A/B testing and performance metrics for prompts

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/fuse-prompt-engineer:prompt-testing

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteBash

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Skill for testing, comparing, and measuring prompt performance.

Supporting Files

docs/methodology.mddocs/metrics.md

SKILL.md

218 lines · ~1.1k tokens

Stats

LanguagePython

Parent stars12

Parent forks1

MaintenanceExcellent

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

Documentation

metrics.md - Performance metrics definition
methodology.md - A/B testing protocol

Testing Workflow

1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations

Performance Metrics

Quality

Metric	Description	Calculation
Accuracy	Correct responses	Correct / Total
Compliance	Format adherence	Compliant / Total
Consistency	Response stability	1 - Variance
Relevance	Meeting the need	Average score (1-5)

Efficiency

Metric	Description	Calculation
Tokens Input	Prompt size	Token count
Tokens Output	Response size	Token count
Latency	Response time	ms
Cost	Price per request	Tokens × Price

Robustness

Metric	Description	Calculation
Edge Cases	Edge case handling	Passed / Total
Jailbreak Resist	Bypass resistance	Blocked / Attempts
Error Recovery	Error recovery	Recovered / Errors

Test Format

Test Dataset

{
  "name": "Test Dataset v1",
  "description": "Dataset for testing prompt XYZ",
  "cases": [
    {
      "id": "case_001",
      "type": "standard",
      "input": "Test input",
      "expected": "Expected output",
      "tags": ["basic", "format"]
    },
    {
      "id": "case_002",
      "type": "edge_case",
      "input": "Edge input",
      "expected": "Expected behavior",
      "tags": ["edge", "error"]
    }
  ]
}

Test Report

# A/B Test Report: {{TEST_NAME}}

## Configuration

| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |

## Tested Variants

### Variant A (Baseline)
[Description or link to prompt A]

### Variant B (Challenger)
[Description or link to prompt B]

## Results

### Overall Scores

| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |

### Detail by Case Type

| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |

### Problematic Cases

| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |

## Analysis

### B's Strengths
- [Improvement 1]
- [Improvement 2]

### B's Weaknesses
- [Regression 1]

### Observations
[Qualitative insights]

## Recommendation

**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

**Confidence**: High / Medium / Low

**Justification**:
[Explanation of recommendation]

## Next Steps
1. [Action 1]
2. [Action 2]

Commands

# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate

Best Practices

Minimum 20 test cases for significance
Include edge cases (15-20% of dataset)
Test multiple runs for consistency
Document hypotheses before testing
Version the prompts being tested

prompt-testing

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

prompt-testing

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Prompt Testing

Documentation

Testing Workflow

Performance Metrics

Quality

Efficiency

Robustness

Test Format

Test Dataset

Test Report

Commands

Decision Criteria

When to adopt variant B?

Best Practices

Similar Skills

Prompt Testing

Documentation

Testing Workflow

Performance Metrics

Quality

Efficiency

Robustness

Test Format

Test Dataset

Test Report

Commands

Decision Criteria

When to adopt variant B?

Best Practices

Similar Skills