npx claudepluginhub yonatangross/orchestkit --plugin orkThis skill uses the workspace's default tool permissions.
Run `claude -p --bare` for fast, clean eval/grading without plugin overhead.
Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Share bugs, ideas, or general feedback.
Run claude -p --bare for fast, clean eval/grading without plugin overhead.
CC 2.1.81 required. The --bare flag skips hooks, LSP, plugin sync, and skill directory walks.
-p call that doesn't need plugins--plugin-dir)# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81
| Call Type | Command Pattern |
|---|---|
| Grading | claude -p "$prompt" --bare --max-turns 1 --output-format text |
| Trigger | claude -p "$prompt" --bare --json-schema "$schema" --output-format json |
| Optimize | echo "$prompt" | claude -p --bare --max-turns 1 --output-format text |
| Force-skill | claude -p "$prompt" --bare --print --append-system-prompt "$content" |
Load detailed patterns and examples:
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
JSON schemas for structured eval output:
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
OrchestKit's eval scripts (npm run eval:skill) auto-detect bare mode:
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
Bare calls: Trigger classification, force-skill, baseline, all grading.
Never bare: run_with_skill (needs plugin context for routing tests).
| Scenario | Without --bare | With --bare | Savings |
|---|---|---|---|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
eval:skill npm script — unified skill evaluation runnereval:trigger — trigger accuracy testingeval:quality — A/B quality comparisonoptimize-description.sh — iterative description improvementdoctor/references/version-compatibility.md