Measure recall / precision / false-proof rate of the pipeline against a ground-truth manifest. Scores either the bundled planted-vulnerability corpus (regression) or a live run's findings.json against a manifest you supply. Deterministic — no agent, no network. Use to prove a change to the producers helps rather than hurts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kuzushi-security-plugin:benchmarkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You can't call bug-finding "world-class" — or catch a regression in it — without a
You can't call bug-finding "world-class" — or catch a regression in it — without a
number. /benchmark scores findings against ground truth and reports the three metrics
that matter: recall (are we missing bugs?), precision (do we cry wolf?), and
falseProofRate (did we prove a non-bug? — the soundness failure differential
testing guards).
node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs"
scores every case under bench/cases/ using its recorded findings.json. Add
--case <name> for one case.node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs" --target "<repo>" --ground-truth "<manifest.json>"
scores <repo>/.kuzushi/findings.json after you've run the pipeline.Flags: --strict (an active finding matching no expectation counts as a false positive —
only fair when the manifest is exhaustive), --line-tolerance N (default 5),
--no-match-cwe (match on file+line only).
{ "expectations": [ { "id", "kind": "vuln" | "safe", "cwe", "filePath", "line" } ] }.
A vuln is a real bug the tool should find; a safe is a decoy that looks like one
and must not be flagged. A decoy that gets an active finding is a false positive; a
decoy that gets a proven finding is a false proof. Author manifests from confirmed bugs
(and their guarded siblings) so the corpus encodes both recall and precision pressure.
corpus aggregates across cases; cases[].perExpectation shows each hit/miss. A drop in
recall means a producer started missing a bug class; a drop in precision or any
falseProofs means it started over-promoting. Wire the corpus run into CI so either
regresses loudly.
/benchmark measures a run that already happened; it runs no analysis.npx claudepluginhub allsmog/kuzushi-security-plugin --plugin kuzushi-security-pluginBlocks Edit/Write/Bash actions until Claude investigates importers, data schemas, and user instructions. Improves output quality by forcing concrete facts before edits.