From bkit
Runs evaluations on skills via evals/runner.js with input validation, stdout/stderr capture, JSON result persistence, and timeout handling. Lists evaluable skills by category.
npx claudepluginhub popup-studio-ai/bkit-claude-code --plugin bkitThis skill is limited to using the following tools:
> v2.1.11 Sprint β FR-β2. Wraps `evals/runner.js` with input validation,
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Share bugs, ideas, or general feedback.
v2.1.11 Sprint β FR-β2. Wraps
evals/runner.jswith input validation, result persistence, and structured reporting. Replaces the barenode evals/runner.js <skill>invocation that previously required users to remember argv structure and ignored timeout / sandbox concerns.
| Argument | Description | Example |
|---|---|---|
run <skill> | Execute the eval suite for one skill | /bkit-evals run gap-detector |
list | List all skills that have an eval.yaml definition | /bkit-evals list |
If no argument is provided, render the same output as list.
run <skill>skill against /^[a-z][a-z0-9-]{0,63}$/. Reject anything else
(no shell metacharacters, no slashes, no spaces) — see Security below.node evals/runner.js --skill <skill> via child_process.spawnSync
(argv form, no shell). Default timeout 30 s, max 120 s. The --skill flag
form is mandated by the runner CLI and locked by L3 contract test.parsed === null and stdout includes
Usage:, return reason: 'argv_format_mismatch'; if parsed === null
otherwise, return reason: 'parsed_null'. Exit code 0 alone NEVER
implies success — the parsed JSON must be present..bkit/runtime/evals-{skill}-{ISO timestamp}.json with stdout/stderr
tails (2000 chars each), parsed payload, and reason field.listevals/config.json to enumerate skill classifications.workflow, capability, hybrid),
list skills that have evals/{classification}/{skill}/eval.yaml.description field if present).[a-z][a-z0-9-]{0,63} is rejected with reason: invalid_skill_name.| Module | Function | Usage |
|---|---|---|
lib/evals/runner-wrapper.js | invokeEvals(skill, opts) | Validate + spawn + persist |
lib/evals/runner-wrapper.js | isValidSkillName(name) | Regex pre-check shared with list |
evals/runner.js | (subprocess) | Existing eval execution engine |
.bkit/runtime/evals-{skill}-{timestamp}.json:
{
"skill": "gap-detector",
"invokedAt": "<ISO 8601>",
"exitCode": 0,
"timedOut": false,
"stdoutTail": "...",
"stderrTail": "...",
"parsed": { /* whatever runner.js prints as JSON, or null */ }
}
# Single eval
/bkit-evals run gap-detector
# Discovery
/bkit-evals list
/control trust — eval results contribute to trust score/code-review — uses eval data when assessing skills/bkit explore (FR-β1) — explore evals as a categoryARGUMENTS: