Skill

skill-selection-evals

Provides eval data to measure Claude's skill selection accuracy, testing direct picks, false positive avoidance, context-dependent routing, and cascade ordering between similar skills.

Anthropic

testing

ai-ml

npx claudepluginhub raddue/crucible

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Supporting Assets

GRADING.mdevals/evals.jsonscripts/run_selection_eval.py

SKILL.md

Similar Skills

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

evaluate-skill

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

skill-tester

Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.

3 files8 tools

skills-toolkit

Stats

Stars10

Forks2

Last CommitApr 10, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Skill-Selection Evals

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Purpose

Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.

Eval Types

Direct selection: Given a prompt, does the agent pick the correct skill?

Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?

Context-dependent: Same verb, different context, different correct skill.

Cascade ordering: Multi-skill tasks requiring correct invocation order.

Boundaries Tested

test-methodology — TDD vs test-coverage vs adversarial-tester

review-direction — code-review vs review-feedback

adversarial-scope — red-team vs inquisitor vs audit vs siege

completion-claims — verify vs finish

bug-handling — debugging vs verify vs audit

Difficulty Ratings

Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).