Skill

eval

Runs unit-test style eval checks on skills via evals/evals.json to verify SKILL.md matches expectations. Use after editing skills or to check all.

testing

code-quality

npx claudepluginhub sonthanh/brain-os-plugin

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Run eval checks against brain-os skills to catch regressions. Uses the standard `evals/evals.json` format from the skill-creator framework.

SKILL.md

Similar Skills

evaluate-skill

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

bkit-evals

519

Runs evaluations on skills via evals/runner.js with input validation, stdout/stderr capture, JSON result persistence, and timeout handling. Lists evaluable skills by category.

4 tools

bkit

Stats

Stars16

Forks8

Last CommitApr 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Skill Eval

Run eval checks against brain-os skills to catch regressions. Uses the standard evals/evals.json format from the skill-creator framework.

Usage

/eval                    # List which skills have evals
/eval self-learn         # Run eval on specific skill
/eval --all              # Run eval on all skills with evals/evals.json

How It Works

Find evals/evals.json in the target skill directory
For each eval entry, read the prompt and expectations
Read the skill's SKILL.md and check each expectation against its content
Report results in unit-test style

Output Format

self-learn (7 evals)
  ✓ #1 Phase 2 validation architecture
  ✓ #2 NotebookLM CLI commands
  ✗ #3 Question distribution — FAILED: "States 30% blind/cross-cutting questions"
  ✓ #4 Script references
  ✓ #5 Quality threshold
  ✓ #6 Note template format
  ✓ #7 Post-completion pipeline

6/7 passed. 1 FAILED.

evals.json Format

Standard schema from skill-creator. Located at evals/evals.json inside each skill directory:

{
  "skill_name": "self-learn",
  "evals": [
    {
      "id": 1,
      "prompt": "Explain the Phase 2 validation architecture",
      "expected_output": "Lists 4 roles with information barriers",
      "expectations": [
        "Mentions Question Generator that creates questions without seeing answers",
        "Mentions Knowledge Agent that answers ONLY from Obsidian notes",
        "Describes information barriers preventing knowledge leaking"
      ]
    }
  ]
}

Implementation

When /eval <skill-name> is invoked:

Read ${CLAUDE_PLUGIN_ROOT}/skills/<skill-name>/evals/evals.json
If not found: report "No evals found for "
Read ${CLAUDE_PLUGIN_ROOT}/skills/<skill-name>/SKILL.md
For each eval, check each expectation against SKILL.md content
Print ✓/✗ per eval with summary

When /eval --all:

Scan all skill dirs for evals/evals.json
Run eval on each
Print combined summary

Adding Evals to a Skill

Create evals/evals.json in your skill directory following the schema above. Each eval should test a critical aspect of the skill that would break if removed.

Outcome log

Follow skill-spec.md § 11. Append to {vault}/daily/skill-outcomes/eval.log:

{date} | eval | {action} | ~/work/brain-os-plugin | N/A | commit:N/A | {result}

action: eval (single skill) or eval-all (all skills)
result: pass if all evals pass, partial if some fail, fail if errors prevent running
Optional: args="{skill-name}", score={passed}/{total}