Skill

bkit-evals

Runs evaluations on skills via evals/runner.js with input validation, stdout/stderr capture, JSON result persistence, and timeout handling. Lists evaluable skills by category.

Bash

Node

Javascript

testing

code-quality

npx claudepluginhub popup-studio-ai/bkit-claude-code --plugin bkit

Tool Access

This skill is limited to using the following tools:

BashReadGlobGrep

Preview

> v2.1.11 Sprint β FR-β2. Wraps `evals/runner.js` with input validation,

SKILL.md

Similar Skills

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

eval

Runs unit-test style eval checks on skills via evals/evals.json to verify SKILL.md matches expectations. Use after editing skills or to check all.

brain-os

evaluate-skill

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

Stats

Stars519

Forks131

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

bkit Evals — Skill Quality Evaluation Runner

v2.1.11 Sprint β FR-β2. Wraps evals/runner.js with input validation, result persistence, and structured reporting. Replaces the bare node evals/runner.js <skill> invocation that previously required users to remember argv structure and ignored timeout / sandbox concerns.

Arguments

Argument	Description	Example
`run <skill>`	Execute the eval suite for one skill	`/bkit-evals run gap-detector`
`list`	List all skills that have an `eval.yaml` definition	`/bkit-evals list`

If no argument is provided, render the same output as list.

Behavior

`run <skill>`

Validate skill against /^[a-z][a-z0-9-]{0,63}$/. Reject anything else (no shell metacharacters, no slashes, no spaces) — see Security below.
Spawn node evals/runner.js --skill <skill> via child_process.spawnSync (argv form, no shell). Default timeout 30 s, max 120 s. The --skill flag form is mandated by the runner CLI and locked by L3 contract test.
Capture stdout / stderr. Parse the trailing JSON block via balanced-brace fallback (string-aware).
Apply fail-closed defense: if parsed === null and stdout includes Usage:, return reason: 'argv_format_mismatch'; if parsed === null otherwise, return reason: 'parsed_null'. Exit code 0 alone NEVER implies success — the parsed JSON must be present.
Persist the structured result to .bkit/runtime/evals-{skill}-{ISO timestamp}.json with stdout/stderr tails (2000 chars each), parsed payload, and reason field.
Render a one-line summary in the chat:
- exit code
- parsed pass/fail counts (if available)
- path of the persisted result file

`list`

Read evals/config.json to enumerate skill classifications.
For each classification (workflow, capability, hybrid), list skills that have evals/{classification}/{skill}/eval.yaml.
Render a category-grouped table with skill name + a one-line note from the eval YAML (description field if present).

Security

Skill name regex prevents argument injection. Anything outside [a-z][a-z0-9-]{0,63} is rejected with reason: invalid_skill_name.
argv-array spawn (no shell). No template-string concatenation into command lines.
Result file path is composed from a hardcoded base + sanitized skill name + timestamp; no traversal possible.
Subprocess timeout enforced (default 30 s, hard cap 120 s) so a buggy eval cannot block the session indefinitely.

Module Dependencies

Module	Function	Usage
`lib/evals/runner-wrapper.js`	`invokeEvals(skill, opts)`	Validate + spawn + persist
`lib/evals/runner-wrapper.js`	`isValidSkillName(name)`	Regex pre-check shared with `list`
`evals/runner.js`	(subprocess)	Existing eval execution engine

Result Schema

.bkit/runtime/evals-{skill}-{timestamp}.json:

{
  "skill": "gap-detector",
  "invokedAt": "<ISO 8601>",
  "exitCode": 0,
  "timedOut": false,
  "stdoutTail": "...",
  "stderrTail": "...",
  "parsed": { /* whatever runner.js prints as JSON, or null */ }
}

Examples

# Single eval
/bkit-evals run gap-detector

# Discovery
/bkit-evals list

/control trust — eval results contribute to trust score
/code-review — uses eval data when assessing skills
/bkit explore (FR-β1) — explore evals as a category

ARGUMENTS:

bkit-evals

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

bkit-evals

Tool Access

Preview

SKILL.md

bkit Evals — Skill Quality Evaluation Runner

Arguments

Behavior

`run <skill>`

`list`

Security

Module Dependencies

Result Schema

Examples

Related

Similar Skills

Help us improve

bkit Evals — Skill Quality Evaluation Runner

Arguments

Behavior

`run <skill>`

`list`

Security

Module Dependencies

Result Schema

Examples

Related