Prompt testing, metrics, and A/B testing frameworks
Runs comprehensive tests on prompts using accuracy, consistency, and robustness metrics. Triggers when you need to validate prompt performance with structured test cases and A/B comparisons.
/plugin marketplace add pluginagentmarketplace/custom-plugin-prompt-engineering/plugin install prompt-engineering-assistant@pluginagentmarketplace-prompt-engineeringThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlreferences/GUIDE.mdscripts/helper.pyBonded to: evaluation-testing-agent
Skill("custom-plugin-prompt-engineering:prompt-evaluation")
parameters:
evaluation_type:
type: enum
values: [accuracy, consistency, robustness, efficiency, ab_test]
required: true
test_cases:
type: array
items:
input: string
expected: string|object
category: string
min_items: 5
metrics:
type: array
items: string
default: [accuracy, consistency]
| Metric | Formula | Target |
|---|---|---|
| Accuracy | correct / total | > 0.90 |
| Consistency | identical_runs / total_runs | > 0.85 |
| Robustness | edge_cases_passed / edge_cases_total | > 0.80 |
| Format Compliance | valid_format / total | > 0.95 |
| Token Efficiency | quality_score / tokens_used | Maximize |
test_case:
id: "TC001"
category: "happy_path|edge_case|adversarial|regression"
description: "What this test verifies"
priority: "critical|high|medium|low"
input:
user_message: "Test input text"
context: {} # Optional additional context
expected:
contains: ["required", "phrases"]
not_contains: ["forbidden", "content"]
format: "json|text|markdown"
schema: {} # Optional JSON schema
evaluation:
metrics: ["accuracy", "format_compliance"]
threshold: 0.9
timeout: 30
categories:
happy_path:
description: "Standard expected inputs"
coverage_target: 40%
examples:
- typical_user_query
- common_use_case
- expected_format
edge_cases:
description: "Boundary conditions"
coverage_target: 25%
examples:
- empty_input
- very_long_input
- special_characters
- unicode_text
- minimal_input
adversarial:
description: "Attempts to break prompt"
coverage_target: 20%
examples:
- injection_attempts
- conflicting_instructions
- ambiguous_requests
- malformed_input
regression:
description: "Previously failed cases"
coverage_target: 15%
examples:
- fixed_bugs
- known_edge_cases
accuracy_rubric:
1.0: "Exact match to expected output"
0.8: "Semantically equivalent, minor differences"
0.5: "Partially correct, major elements present"
0.2: "Related but incorrect"
0.0: "Completely wrong or off-topic"
consistency_rubric:
1.0: "Identical across all runs"
0.8: "Minor variations (punctuation, formatting)"
0.5: "Significant variations, same meaning"
0.2: "Different approaches, inconsistent"
0.0: "Completely different each time"
robustness_rubric:
1.0: "Handles all edge cases correctly"
0.8: "Fails gracefully with clear error messages"
0.5: "Some edge cases cause issues"
0.2: "Many edge cases fail"
0.0: "Breaks on most edge cases"
ab_test_config:
name: "Prompt Variant Comparison"
hypothesis: "New prompt improves accuracy by 10%"
variants:
control:
prompt: "{original_prompt}"
allocation: 50%
treatment:
prompt: "{new_prompt}"
allocation: 50%
metrics:
primary:
- name: accuracy
min_improvement: 0.05
significance: 0.05
secondary:
- token_count
- response_time
- user_satisfaction
sample:
minimum: 100
power: 0.8
stopping_rules:
- condition: significant_regression
action: stop_treatment
- condition: clear_winner
threshold: 0.99
action: early_stop
report:
metadata:
prompt_name: "string"
version: "string"
date: "ISO8601"
evaluator: "string"
summary:
overall_score: 0.87
status: "PASS|FAIL|REVIEW"
recommendation: "string"
metrics:
accuracy: 0.92
consistency: 0.88
robustness: 0.79
format_compliance: 0.95
test_results:
total: 50
passed: 43
failed: 7
pass_rate: 0.86
failures:
- id: "TC023"
category: "edge_case"
issue: "Description of failure"
severity: "high|medium|low"
input: "..."
expected: "..."
actual: "..."
recommendations:
- priority: high
action: "Specific improvement"
- priority: medium
action: "Another improvement"
workflow:
1_prepare:
- Load prompt under test
- Load test cases
- Configure metrics
2_execute:
- Run all test cases
- Record outputs
- Measure latency
3_score:
- Compare against expected
- Calculate metrics
- Identify patterns
4_analyze:
- Categorize failures
- Find root causes
- Prioritize issues
5_report:
- Generate report
- Make recommendations
- Archive results
| Issue | Cause | Solution |
|---|---|---|
| All tests fail | Wrong expected outputs | Review test design |
| Flaky tests | Non-determinism | Lower temperature |
| False positives | Lenient matching | Tighten criteria |
| False negatives | Strict matching | Use semantic similarity |
| Slow evaluation | Too many tests | Sample strategically |
integrates_with:
- prompt-optimization: Improvement feedback loop
- all skills: Quality gate before deployment
automation:
ci_cd: "Run on prompt changes"
scheduled: "Weekly regression"
triggered: "On demand"
See references/GUIDE.md for evaluation methodology.
See scripts/helper.py for automation utilities.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.