npx claudepluginhub raddue/crucibleThis skill uses the workspace's default tool permissions.
This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Share bugs, ideas, or general feedback.
This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.
Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.
Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).
evals/evals.json — the eval dataGRADING.md — grading criteria and baseline measurement protocol