From vigiles
Writes and runs tests for Claude Code harness components (hooks, skills, settings, CLAUDE.md) using vigilestiers — starts at the cheapest tier that can answer the question.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vigiles:test-harnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Test the Claude Code **harness** — the hooks, skills, settings, and CLAUDE.md that
Test the Claude Code harness — the hooks, skills, settings, and CLAUDE.md that steer an agent — as the assembled machine it ships as. vigiles gives three tiers, cheapest first; this skill picks the right one, writes the test, and runs it.
The guiding rule: start at the cheapest tier that can answer the question, and climb only when it genuinely can't. Two of the three tiers need no model and no API key, so they run on every commit for free — reach for the paid real-model tier only when the question actually requires a real model.
Match what you're testing to the cheapest tier that can answer it:
| What you're testing | Tier | Cost | API |
|---|---|---|---|
| "Does this hook block/allow event X?" — pure hook logic, every event type (incl. Edit/Write, PreCompact, SessionEnd, SubagentStop) | Unit | free, milliseconds, no claude | runHook |
| "Is the hook actually wired into the assembled plugin and does it fire in a real session?" | Deterministic | free, no API key (real claude + scripted mock) | runHarnessTest + scriptModel |
"Did the injected context (a SessionStart hook, a /command) actually reach the model?" | Deterministic | free, no API key | runHarnessTest → trace.modelRequests / assertRequestContains |
| "Does this skill's description trigger when it should (recall) and stay quiet when it shouldn't (precision)?" | Eval | paid (real model) | measureTriggerRate (+ irrelevantPrompts) → assertTriggerRate({ min, maxFalsePositive }) |
| "Is this exact skill's output any good?" — absolute quality, no on/off baseline (the default for testing one skill) | Eval | paid (real model) | measure({ checks: [judged(rubric)] }) → assertRates({ min }) |
| "Does this harness change move what the agent does, relative to off?" — A/B lift, regression, signal vs noise | Eval | paid (real model) | runEval (arms) + assertSignificant |
Most harness questions — block/allow, wired-in, context-landed — never need a model. Only "does the model trigger / behave differently" needs the eval tier.
If the unit and deterministic tiers can both answer it, prefer unit: it's faster and reaches events the deterministic mock can't drive.
Be explicit with the user about which bucket each surface falls into — never let "we'll test it" hide whether that's free, sub-priced, or needs a container. Every surface sorts into one of three buckets:
runHook), a tool-contract / "did NOT call the forbidden
tool" check, structural facts (vigiles scan), and record-replay of any tool
a skill shells out to (record the real result once, replay it via a PATH stub).measureTriggerRate, recall + precision) and
does its guidance actually produce good output (score it directly:
measure({ checks: [judged(rubric)] }) + assertRates — the absolute oracle;
use a runEval A/B on-vs-off only when you need the relative lift). This is
the half a prose / guidance skill lives in —
its worth is behavioral, so only a model can judge it. That is not "uncovered"
and not free: it's fully testable on the sub. State it that way.So a prose-skill library is roughly ~100% testable (some free, most on your sub), ~0% needs-a-container — not "poorly covered." An accessibility/browser plugin is the worst case, with a large bucket C. When you report coverage, give two numbers: "% testable at all (free + sub)" vs "% that needs a container", and say which surfaces are free vs sub-priced. The model-gated half is the point of the eval pillar (affordable on the sub), not a gap — and testing a prose skill's behavior requires a real model for everyone (promptfoo, the SDKs, all of it); vigiles just does it on your subscription instead of metered API.
Check whether vigiles is a dependency (package.json), and install it as a
dev dependency if not:
npm i -D vigiles # or: pnpm add -D vigiles / yarn add -D vigiles
The deterministic tier additionally needs the claude CLI on PATH (no API key):
npm i -g @anthropic-ai/claude-code. The eval tier needs model auth. If the
claude CLI is missing, you can still write and run unit-tier tests.
Find what the project actually ships, in this order:
.claude/settings.json / .claude/settings.local.json — inline hooks..claude-plugin/plugin.json — a plugin manifest (hooks, skills, agents, mcpServers).hooks/hooks.json — the plugin hooks convention (e.g. obra/superpowers).skills/<name>/SKILL.md, agents/<name>.md, commands/<name>.md.Pick one concrete thing to pin down — a specific PreToolUse hook, a specific
SessionStart injection, a specific skill.
Unit (runHook) — hand a hook a synthesized event, assert the decision:
import { runHook, assertHookBlocked } from "vigiles/testing";
const r = runHook(hookCommand, {
hook_event_name: "PreToolUse",
tool_name: "Bash",
tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"
Testing a hook you didn't write (a vendored third-party script)? Mark it
{ trusted: false } and it runs confined under bubblewrap by default (read-only
host, cleared env, no network egress). Add { recordEgress: true } to also
record what it tries to reach — r.egress plus assertNoEgress(r) /
assertEgressOnly(r, [...]) — the supply-chain check for "what does this skill
phone home to / install from?". When the hook's setup needs a real install,
{ egress: { allow: ["registry.npmjs.org"] } } lets it reach only that
allowlist (a packet-layer nft wall, so a raw socket off-list is dropped too) →
r.egress (allowed hosts) + r.egressDropped. Be precise about the boundaries:
see
docs/sandboxing.md (it blocks destruction and
egress, but does NOT isolate reads of host files, and only under bwrap).
Deterministic (runHarnessTest) — load the real plugin, drive a scripted
mock model, assert the hook fired (or the context landed):
import {
runHarnessTest,
scriptModel,
assertHookFired,
assertRequestContains,
} from "vigiles/testing";
const r = await runHarnessTest({
pluginDir: "./", // or { settings: { hooks: {...} } }
transcript: true,
model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // did it actually land?
Eval — absolute (measure + judged) — testing one skill, the usual case:
score its output directly against a rubric. No on/off baseline — this is the
"is it any good?" oracle (what promptfoo/DeepEval lead with), and the right
default when there's nothing to compare against:
import { measure, judged, skill, assertRates } from "vigiles/testing";
const report = await measure({
pluginDir: "./",
task: "…a task the skill should handle…",
checks: [
skill("my-plugin:my-skill"), // it fired
judged("the answer correctly does X and avoids Y"), // …and the output is good
],
trials: 6,
});
assertRates(report, { min: 0.8 }); // each check passes ≥ 80% of trials
Eval — relative (runEval + assertSignificant) — when the question is
lift over no-skill (regression, or proving a change isn't noise): A/B the
change on vs off and gate on significance, not eyeballing:
import { runEval, assertSignificant } from "vigiles/testing";
const report = await runEval({
arms: { off: {}, on: { pluginDir: "./" } },
task: "…a task the harness change should affect…",
measure: (ctx) => ({ ok: /* a bare predicate over the trace */ true }),
trials: 6,
cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });
In a runner (node:test / vitest / jest) the tests are plain async functions. Or use the zero-setup CLI, which discovers and runs the files:
npx vigiles test # *.harness.{mjs,ts} — unit + deterministic, no API key
npx vigiles eval --trials=6 # *.eval.{mjs,ts} — real model (local / nightly, not CI)
Unit-tier runHook tests need no claude and always run — write and run them
even with no claude installed. A tier that genuinely can't run reports a loud
⊘ SKIPPED (tallied separately, never a fake ✓); a standalone script emits one
via skip(reason) from vigiles/testing. A skip passes by default, but in a CI
job that asserts the capability is present, run vigiles test --no-skip so a
skipped tier fails — a green-with-skips is untested surface. Keep unit +
deterministic tests in CI (free); run evals locally or on a schedule with auth.
Don't ask them to specify — pick something real and demonstrate. Scan the harness surface (Step 2), choose the cheapest meaningful test, write it, run it, and show the result. Good default picks, in order:
PreToolUse hook → unit-test that it blocks the thing it's meant to block (and allows a safe sibling).SessionStart hook that injects context → deterministic test that the text actually reaches the model (assertRequestContains).pluginDir, then offer the paid measureTriggerRate eval as a follow-up.Then say which tier you used and why, and offer to climb a tier if the cheaper test can't fully answer their question.
The full guide — every tier, testing skills for real, "fired ≠ landed", the
safe-by-default sandbox, the coverage matrix, and how it compares to promptfoo —
is in docs/harness-testing.md.
npx claudepluginhub zernie/vigilesDiagnoses Claude Code harness health (hooks, skills, agents, rules, MCP, eval) across 8 dimensions, scores 0-24 with S-D grades, and provides improvement suggestions. Scans ~/.claude/. Triggers: harness audit, 하네스 진단.
Scaffolds pytest smoke tests and runs behavioral tests for Claude Code skills in Docker harness. Generates golden files, runs pytest, reports LLM verdicts and costs.
Tests ClaudeTracker's hooks integration end-to-end by sending 21 hook events via PowerShell script through HookBridge and verifying delivery in logs and notifications.