Evaluate skills by executing them across sonnet, opus, and haiku models using sub-agents. Use when testing if a skill works correctly, comparing model performance, or finding the cheapest compatible model. Returns numeric scores (0-100) to differentiate model capabilities.
Execute skills across haiku, sonnet, and opus models using sub-agents, scoring behaviors 0-100 with weighted calculations. Use when testing skill compatibility, comparing model performance, or finding the cheapest viable model.
/plugin marketplace add taisukeoe/agentic-ai-skills-creator/plugin install skills-helper@agentic-skills-creatorThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/evaluation-structure.mdtests/scenarios.mdEvaluate skills across multiple Claude models using sub-agents with quality-based scoring.
Requirement: Claude Code CLI only. Not available in Claude.ai.
Binary pass/fail ("did it do X?") fails to differentiate models - all models can "do the steps." The difference is how well they do them. This skill uses weighted scoring to reveal capability differences.
Check for tests/scenarios.md in the target skill directory.
Default to difficult scenarios: When multiple scenarios exist, prioritize Hard or Medium difficulty scenarios for evaluation. Easy scenarios often don't show meaningful differences between models and aren't realistic for production use.
Required scenario format:
## Scenario: [Name]
**Difficulty:** Easy | Medium | Hard | Edge-case
**Query:** User request that triggers this skill
**Expected behaviors:**
1. [Action description]
- **Minimum:** What counts as "did it"
- **Quality criteria:** What "did it well" looks like
- **Haiku pitfall:** Common failure mode
- **Weight:** 1-5
**Output validation:** (optional)
- Pattern: `regex`
- Line count: `< N`
If scenarios.md missing or uses old format: Ask user to update following references/evaluation-structure.md.
Spawn Task sub-agents for each model in parallel.
Prompt template:
Execute the skill at {skill_path} with this query:
{evaluation_query}
IMPORTANT:
- Actually execute the skill, don't just describe what you would do.
- Create output directory under Claude Code's working directory ($PWD):
$PWD/.ai_text/{yyyyMMdd}/tmp/{skill_name}-{model}-{hhmmss}/
(Example: If $PWD=/path/to/project, create /path/to/project/.ai_text/20250101/tmp/formatting-tables-haiku-143052/)
- Create all output files under that directory.
- If the skill asks questions, record the exact questions, then assume reasonable answers and proceed.
Return ONLY (keep it brief to minimize tokens):
- Questions skill asked: [list exact questions the skill asked you, or "none"]
- Assumed answers: [your assumed answers to those questions, or "n/a"]
- Key decisions: [1-2 sentences on freedom level, structure choices]
- Files created: [paths only, no content]
- Errors: [any errors, or "none"]
Do NOT include file contents or detailed explanations.
Use Task tool with model parameter: haiku, sonnet, opus
After sub-agents complete: Read created files directly using Glob + Read to evaluate file quality (naming, structure, content). The minimal report provides process info (questions, decisions) that can't be inferred from files.
For each expected behavior, score 0-100:
| Score | Meaning |
|---|---|
| 0 | Not attempted or completely wrong |
| 25 | Attempted but below minimum |
| 50 | Meets minimum criteria |
| 75 | Meets most quality criteria |
| 100 | Meets all quality criteria |
Scoring checklist per behavior:
Behavior Score = base_score // after applying deductions (e.g., Haiku pitfalls)
Total = Σ(behavior_score × weight) / Σ(weights)
Rating thresholds:
| Score | Rating | Meaning |
|---|---|---|
| 90-100 | ✅ Excellent | Production ready |
| 75-89 | ✅ Good | Acceptable |
| 50-74 | ⚠️ Partial | Quality issues |
| 25-49 | ⚠️ Marginal | Significant problems |
| 0-24 | ❌ Fail | Does not work |
After evaluation, add a table to the skill's README documenting the results:
README section format:
## Evaluation Results
| Date | Scenario | Difficulty | Model | Score | Rating |
|------|----------|------------|-------|-------|--------|
| 2025-01-15 | Standard workflow | Hard | claude-haiku-4-5-20250101 | 42 | ⚠️ Marginal |
| 2025-01-15 | Standard workflow | Hard | claude-sonnet-4-5-20250929 | 85 | ✅ Good |
| 2025-01-15 | Standard workflow | Hard | claude-opus-4-5-20251101 | 100 | ✅ Excellent |
Table requirements:
claude-sonnet-4-5-20250929) not just short namesThis creates a historical record of how the skill performs across models and improvements over time.
## Model Evaluation Results
**Skill:** {skill_path}
**Scenario:** {scenario_name} ({difficulty})
**Date:** {YYYY-MM-DD}
### Scores by Behavior
| Behavior | Weight | claude-haiku-4-5-20250101 | claude-sonnet-4-5-20250929 | claude-opus-4-5-20251101 |
|----------|--------|---------------------------|----------------------------|--------------------------|
| Asks clarifying questions | 4 | 25 | 75 | 100 |
| Determines freedom level | 3 | 50 | 75 | 100 |
| Creates proper SKILL.md | 5 | 50 | 100 | 100 |
### Total Scores
| Model | Score | Rating |
|-------|-------|--------|
| claude-haiku-4-5-20250101 | 42 | ⚠️ Marginal |
| claude-sonnet-4-5-20250929 | 85 | ✅ Good |
| claude-opus-4-5-20251101 | 100 | ✅ Excellent |
### Observations
- Haiku: Skipped justification for freedom level (pitfall)
- Haiku: Asked only 1 generic question vs 3 specific
- Sonnet: Met all quality criteria except verbose output
### Next Steps
- Add these results to the skill's README (see Step 5)
- Consider model selection based on your quality requirements and budget
| Model | Pitfall | Detection |
|---|---|---|
| haiku | Shallow questions | Count specificity |
| haiku | Skip justification | Check reasoning present |
| haiku | Miss references | Check files read |
| sonnet | Over-engineering | Check scope creep |
| sonnet | Verbose reporting | High token count vs output |
| opus | Over-verbose output | Token count |
Note: Token usage includes both skill execution AND reporting overhead. Sonnet tends to produce detailed reports, which inflates token count. Compare tool uses for execution efficiency.
Load scenarios (prioritize Hard) → Execute (parallel) → Score behaviors → Calculate totals → Add to README → Output summary
For detailed scoring guidelines, see references/evaluation-structure.md.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.