Evaluate prompt effectiveness using metrics and test cases
Evaluates prompt effectiveness using systematic metrics and test cases inspired by DSPy and OPRO methodologies.
/plugin marketplace add standardbeagle/standardbeagle-tools/plugin install prompt-engineer@standardbeagle-toolsEvaluate prompt effectiveness using systematic metrics and test cases inspired by DSPy and OPRO methodologies.
Use AskUserQuestion to establish metrics:
Question 1: "What does success look like for this prompt?"
Question 2: "What specific metrics matter most?" (multiSelect: true)
Generate a scoring rubric:
## Evaluation Rubric
### Dimension 1: [Metric Name]
**Weight**: X%
| Score | Criteria |
|-------|----------|
| 5 | [Excellent - specific criteria] |
| 4 | [Good - specific criteria] |
| 3 | [Acceptable - specific criteria] |
| 2 | [Needs improvement - specific criteria] |
| 1 | [Unacceptable - specific criteria] |
### Dimension 2: [Metric Name]
**Weight**: X%
[Same structure]
### Overall Score Calculation
Score = (D1 * W1) + (D2 * W2) + ... / Total Weights
Create comprehensive test cases:
## Test Cases
### Category 1: Happy Path
Tests that should succeed with high scores.
**Test 1.1: [Name]**
- Input: [Test input]
- Expected behavior: [What should happen]
- Success criteria: [Specific measurable outcome]
**Test 1.2: [Name]**
[Same structure]
### Category 2: Edge Cases
Tests at boundaries of expected behavior.
**Test 2.1: [Name]**
- Input: [Edge case input]
- Expected behavior: [Handling strategy]
- Success criteria: [What counts as success]
### Category 3: Adversarial Cases
Tests that should trigger refusals or special handling.
**Test 3.1: [Name]**
- Input: [Adversarial input]
- Expected behavior: [Refusal/redirect]
- Success criteria: [Appropriate handling]
### Category 4: Stress Tests
Tests with complex or large inputs.
**Test 4.1: [Name]**
- Input: [Complex/large input]
- Expected behavior: [Quality maintenance]
- Success criteria: [Performance requirements]
Execute tests and record results:
## Evaluation Results
### Test Results Summary
| Test | Score | D1 | D2 | D3 | Notes |
|------|-------|----|----|----|----- |
| 1.1 | X/5 | X | X | X | [Observation] |
| 1.2 | X/5 | X | X | X | [Observation] |
| 2.1 | X/5 | X | X | X | [Observation] |
| ... | ... | ... | ... | ... | ... |
### Aggregate Metrics
- **Mean Score**: X.XX / 5
- **Std Deviation**: X.XX
- **Min Score**: X.XX (Test [ID])
- **Max Score**: X.XX (Test [ID])
### Category Performance
- Happy Path: X.XX / 5 (N tests)
- Edge Cases: X.XX / 5 (N tests)
- Adversarial: X.XX / 5 (N tests)
- Stress Tests: X.XX / 5 (N tests)
Identify patterns in underperforming tests:
## Failure Analysis
### Pattern 1: [Failure Type]
**Affected Tests**: [List]
**Symptoms**: [What goes wrong]
**Root Cause**: [Why it happens]
**Recommendation**: [How to fix]
### Pattern 2: [Failure Type]
[Same structure]
### Severity Matrix
| Failure Pattern | Frequency | Severity | Priority |
|-----------------|-----------|----------|----------|
| [Pattern 1] | X% | High | P1 |
| [Pattern 2] | X% | Medium | P2 |
Based on analysis, suggest prompt improvements:
## Improvement Recommendations
### High Priority (Address Immediately)
1. **[Issue]**: [Specific prompt change]
- Expected impact: +X% on [metric]
- Affected tests: [List]
### Medium Priority (Significant Improvement)
1. **[Issue]**: [Specific prompt change]
- Expected impact: +X% on [metric]
### Low Priority (Polish)
1. **[Issue]**: [Specific prompt change]
- Expected impact: +X% on [metric]
### Recommended Iterations
1. Apply high-priority changes
2. Re-run evaluation
3. Compare scores
4. Iterate until target met
If comparing two prompts:
## A/B Comparison Report
### Prompt A vs Prompt B
| Metric | Prompt A | Prompt B | Delta | Winner |
|--------|----------|----------|-------|--------|
| Overall | X.XX | X.XX | +X.XX | A/B |
| Accuracy | X.XX | X.XX | +X.XX | A/B |
| Format | X.XX | X.XX | +X.XX | A/B |
| Consistency | X.XX | X.XX | +X.XX | A/B |
### Statistical Significance
- Sample size: N tests
- P-value: X.XXX
- Confidence: XX%
### Recommendation
[Which prompt to use and why]
Use Claude to evaluate outputs:
## LLM Evaluation Prompt
You are evaluating an AI response. Score it on:
1. Accuracy (1-5): Is the information correct?
2. Relevance (1-5): Does it address the question?
3. Format (1-5): Does it follow the required format?
<input>{{original_input}}</input>
<response>{{ai_response}}</response>
<expected>{{expected_output}}</expected>
Provide scores and brief justification for each.
Test output stability:
## Consistency Test Protocol
1. Run same input N times (N=5 recommended)
2. Compare outputs for:
- Factual consistency
- Format consistency
- Key point coverage
3. Calculate agreement rate
4. Flag high-variance cases
Track changes over prompt iterations:
## Regression Test Suite
### Baseline: Prompt v1.0
[Stored evaluation results]
### Current: Prompt vX.X
[New evaluation results]
### Regressions
- Test [ID]: Score dropped from X to Y
- Test [ID]: New failure mode detected
### Improvements
- Test [ID]: Score improved from X to Y
- Test [ID]: New edge case handled