Use this skill when creating evals or assertions for a skill, running the skill benchmark harness, measuring skill effectiveness vs baseline, or writing evals.json files alongside skills. Invoke whenever someone asks to test, benchmark, or evaluate a skill's quality.
From bopen-toolsnpx claudepluginhub b-open-io/claude-plugins --plugin bopen-toolsThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).
Only two types of skills produce measurable benchmark delta:
What does NOT produce delta (don't waste time benchmarking these):
Before writing evals for a skill, verify ALL of these:
If any box fails, the skill is not a good benchmark candidate.
Every skill that wants benchmarking needs an evals/evals.json file:
skills/
my-skill/
SKILL.md
evals/
evals.json
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "The exact prompt to send to the model",
"expected_output": "Description of what a good response looks like",
"files": [],
"assertions": [
{
"id": "unique-assertion-id",
"text": "Specific, verifiable claim about the output",
"type": "qualitative"
}
]
}
]
}
Every eval prompt must be a trap — a prompt that reliably elicits the bad behavior the skill suppresses. If the baseline model passes your assertions without the skill, your test case is useless.
| Skill | Trap prompt | What baseline does wrong |
|---|---|---|
| humanize | "Write 4 company values with descriptions" | Produces tricolons, binary contrasts, punchline endings |
| humanize | "Explain the pros and cons of X" | Uses "not X — it's Y" pattern |
| geo-optimizer | "Generate an AgentFacts schema following NANDA" | Doesn't know NANDA protocol, hallucinates |
| geo-optimizer | "Audit this site for AI search visibility" | Doesn't know hedge density, 1MB threshold |
A proper eval checks BOTH directions:
If baseline passes an assertion, that assertion is not measuring delta.
| Type | Reliability | Cost | Best for |
|---|---|---|---|
not-contains / regex | Highest | Free | Banned phrases, specific patterns |
| Binary LLM judge | High | 1 API call | Presence/absence of behavior |
| G-Eval rubric (CoT) | Medium | 1 API call | Multi-dimensional quality |
Default to negative assertions for suppression skills. "Output does NOT contain tricolons" is more reliable than "output sounds natural."
Bad assertions (will show 0% delta):
Good assertions (will show real delta):
If you're unsure what assertions to write for a new skill:
This prevents guessing at assertions that don't actually differentiate.
bun run benchmark # All skills with evals
bun run benchmark --skill geo-optimizer # Single skill
bun run benchmark --model claude-sonnet-4-6 # Override model (default: haiku)
bun run benchmark --concurrency 4 # Parallel workers
From within Claude Code, prefix with CLAUDECODE= to avoid nested session errors.
The harness runs each eval prompt twice: once with the skill injected via --append-system-prompt, once without. Both outputs are graded by LLM-as-judge.
Results go to benchmarks/latest.json and per-skill evals/benchmark.json:
| Delta | Meaning | Action |
|---|---|---|
| > +20% | Strong skill | Publish |
| +1% to +20% | Weak signal | Improve evals or skill |
| 0% | No effect | Skill is redundant OR evals test wrong thing |
| Negative | Skill hurts | Skill confuses model or evals are bad |
latest.json merges per-skill results when using --skill flagThe LLM-as-judge has known failure modes. When results seem wrong:
| Symptom | Likely cause | Fix |
|---|---|---|
| Everything passes | Assertions too vague | Make assertions more specific and binary |
| Inconsistent across runs | Judge non-deterministic | Need temperature=0, CoT before verdict |
| Skill and baseline score the same | Testing knowledge model already has | Redesign as behavioral suppression test |
| Skill scores lower than baseline | Skill constraining model too much | Check if skill instructions conflict with prompt |
These patterns have been confirmed through multiple benchmark runs: