Help us improve
Share bugs, ideas, or general feedback.
From bopen-tools
Creates evals, assertions, and evals.json files for skills; runs benchmark harness to measure effectiveness against baseline. Use when testing skill quality.
npx claudepluginhub b-open-io/claude-plugins --plugin bopen-toolsHow this skill is triggered — by the user, by Claude, or both
Slash command
/bopen-tools:benchmark-skillsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Analyzes SKILL.md, sub-skills, scripts, and tests to generate eval.yaml configs for agent eval harness including dataset schema, judges, and thresholds.
Share bugs, ideas, or general feedback.
Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).
Only two types of skills produce measurable benchmark delta:
What does NOT produce delta (don't waste time benchmarking these):
Before writing evals for a skill, verify ALL of these:
If any box fails, the skill is not a good benchmark candidate.
Every skill that wants benchmarking needs an evals/evals.json file:
skills/
my-skill/
SKILL.md
evals/
evals.json
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "The exact prompt to send to the model",
"expected_output": "Description of what a good response looks like",
"files": [],
"assertions": [
{
"id": "unique-assertion-id",
"text": "Specific, verifiable claim about the output",
"type": "qualitative"
}
]
}
]
}
Every eval prompt must be a trap — a prompt that reliably elicits the bad behavior the skill suppresses. If the baseline model passes your assertions without the skill, your test case is useless.
| Skill | Trap prompt | What baseline does wrong |
|---|---|---|
| humanize | "Write 4 company values with descriptions" | Produces tricolons, binary contrasts, punchline endings |
| humanize | "Explain the pros and cons of X" | Uses "not X — it's Y" pattern |
| geo-optimizer | "Generate an AgentFacts schema following NANDA" | Doesn't know NANDA protocol, hallucinates |
| geo-optimizer | "Audit this site for AI search visibility" | Doesn't know hedge density, 1MB threshold |
A proper eval checks BOTH directions:
If baseline passes an assertion, that assertion is not measuring delta.
| Type | Reliability | Cost | Best for |
|---|---|---|---|
not-contains / regex | Highest | Free | Banned phrases, specific patterns |
| Binary LLM judge | High | 1 API call | Presence/absence of behavior |
| G-Eval rubric (CoT) | Medium | 1 API call | Multi-dimensional quality |
Default to negative assertions for suppression skills. "Output does NOT contain tricolons" is more reliable than "output sounds natural."
Bad assertions (will show 0% delta):
Good assertions (will show real delta):
If you're unsure what assertions to write for a new skill:
This prevents guessing at assertions that don't actually differentiate.
bun run benchmark # All skills with evals
bun run benchmark --skill geo-optimizer # Single skill
bun run benchmark --model claude-sonnet-4-6 # Override model (default: haiku)
bun run benchmark --concurrency 4 # Parallel workers
From within Claude Code, prefix with CLAUDECODE= to avoid nested session errors.
The harness runs each eval prompt twice: once with the skill injected via --append-system-prompt, once without. Both outputs are graded by LLM-as-judge.
Results go to benchmarks/latest.json and per-skill evals/benchmark.json:
| Delta | Meaning | Action |
|---|---|---|
| > +20% | Strong skill | Publish |
| +1% to +20% | Weak signal | Improve evals or skill |
| 0% | No effect | Skill is redundant OR evals test wrong thing |
| Negative | Skill hurts | Skill confuses model or evals are bad |
latest.json merges per-skill results when using --skill flagThe LLM-as-judge has known failure modes. When results seem wrong:
| Symptom | Likely cause | Fix |
|---|---|---|
| Everything passes | Assertions too vague | Make assertions more specific and binary |
| Inconsistent across runs | Judge non-deterministic | Need temperature=0, CoT before verdict |
| Skill and baseline score the same | Testing knowledge model already has | Redesign as behavioral suppression test |
| Skill scores lower than baseline | Skill constraining model too much | Check if skill instructions conflict with prompt |
These patterns have been confirmed through multiple benchmark runs: