Help us improve
Share bugs, ideas, or general feedback.
From purlin
Evaluates proof quality with STRONG/WEAK/HOLLOW assessments. Features configurable criteria, caching, and parallel subagent evaluation.
npx claudepluginhub rlabarca/purlin --plugin purlinHow this skill is triggered — by the user, by Claude, or both
Slash command
/purlin:auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Audit all proofs (or a specific feature) against configurable criteria. Read-only — never modifies code or test files.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Guides systematic root-cause debugging when tests fail, builds break, or unexpected errors occur. Provides a structured triage checklist to preserve evidence, localize, and fix issues instead of guessing.
Share bugs, ideas, or general feedback.
Audit all proofs (or a specific feature) against configurable criteria. Read-only — never modifies code or test files.
purlin:audit Audit all features with receipts
purlin:audit <feature> Audit a specific feature
purlin:audit --criteria <path> Use a specific criteria file
Load combined criteria via the single-source function:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/audit/static_checks.py --load-criteria --project-root <project_root>
If --criteria <path> was passed by the user, add --extra <path> to append that file too.
This returns built-in criteria + any configured additional team criteria + any extra file. Built-in criteria always apply — additional criteria are appended, never replace.
Display: Using audit criteria: built-in (Criteria-Version: N) and if additional criteria are present: + team criteria from <source> (pinned: <sha>)
Read .purlin/cache/audit_cache.json via:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/audit/static_checks.py --read-cache
The cache maps proof hashes to previous assessments:
{
"a1b2c3d4e5f6a7b8": {
"assessment": "STRONG",
"criterion": "matches rule intent",
"why": "test exercises the rule correctly",
"fix": "none",
"cached_at": "2026-04-03T..."
}
}
For each proof that reaches Pass 2, compute the proof hash from (rule text + proof description + test function code). If the hash exists in the cache, use the cached assessment — skip the LLM call. Report cached results with a (cached) label:
PROOF-1 (RULE-1): STRONG ✓ (cached)
After the audit completes, write all new assessments to the cache (both cached hits and fresh LLM results). This means the cache grows over time and subsequent runs are faster.
After loading the cache, categorize features for parallel execution:
assert True). If all proofs still pass Pass 1 and have cache hits, use cached assessments — no LLM needed.For features in the "Needs LLM" category, launch up to 3 parallel evaluations using the Agent tool:
Agent(subagent_type="purlin-auditor", prompt="Audit feature <name>: ...")
Agent(subagent_type="purlin-auditor", prompt="Audit feature <name>: ...")
Agent(subagent_type="purlin-auditor", prompt="Audit feature <name>: ...")
Each subagent receives:
When all subagents complete, merge their results into the final report and update the cache with all new assessments.
For "Cache-only" features, evaluate them in the main context (no subagent needed — they're fast).
Before reading any source code, run structural checks on the proof JSON files. These operate on JSON regardless of what language produced the proofs:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/audit/static_checks.py --check-proof-file --proof-path <proof_json> --spec-path <spec_path>
Checks:
Report findings inline with the feature's audit output. Proof ID collisions indicate confused proof tracking; orphans indicate stale markers.
Run the deterministic static checker on all specs with proofs:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/audit/static_checks.py <test_file> <feature_name> --spec-path <spec_path>
Any proof that fails a structural check is immediately rated HOLLOW — no LLM override possible:
PROOF-3 (RULE-3): HOLLOW ✗ (deterministic)
Check: logic_mirroring
Why: expected value computed by hash_func() — same function being tested. If hash_func has a bug, test confirms the bug.
Fix: replace expected = hash_func(input) with a precomputed literal: assert result == "5e884898..."
The (deterministic) label tells the user this was caught by static analysis, not LLM judgment.
Proofs that passed Pass 1 go to the LLM for classification and semantic evaluation. The LLM first classifies each proof as structural or behavioral, then evaluates behavioral proofs.
Batch all proofs for a feature into a single LLM evaluation. Do NOT evaluate proofs one-at-a-time — this wastes LLM calls. Construct one prompt per feature containing ALL surviving proofs (those that passed Pass 1 and are not cache hits).
For each feature being audited:
## Proof section — get every proof description..proofs-*.json entries.self parameter or @pytest.fixture(scope="class")), include the fixture code in the prompt. This is critical for e2e tests where the "act" step is in the fixture.@manual proofs: check staleness only, assess as MANUAL — exclude from LLM batch./ prefix from a spec in specs/_anchors/). If so:
For Claude (default auditor):
You are classifying and evaluating proofs against spec rules.
Structural issues (assert True, no assertions, logic mirroring) have already been checked and passed.
STEP 1 — CLASSIFY each proof as STRUCTURAL or BEHAVIORAL:
Examine the proof description, test code, AND fixture/setup code together.
STRUCTURAL — the content being checked exists independently of the test.
The test reads pre-existing files or static content that no code in the
test's setup chain produced. Examples: checking a config template has
certain fields, grepping source code for forbidden patterns, verifying a
markdown doc has correct sections.
BEHAVIORAL — the test verifies output produced by running code. Includes:
- Direct function calls whose return value is asserted
- E2E tests where a fixture runs the system (subprocess, API call,
function invocation) and assertions check the artifacts it created
- Tests that check files/strings CREATED by the test's setup chain
- Tests where the "act" step is in a class-scoped fixture
Key signal: if the fixture or setup runs code that produces the artifact
being checked, the test is BEHAVIORAL — even if the assertions use
string-matching or regex on file contents. The question is not "what do
the assertions look like?" but "did code run to produce what's being
asserted on?"
STRUCTURAL proofs → EXCLUDED (not scored)
STEP 2 — EVALUATE each BEHAVIORAL proof:
For each behavioral proof, answer ONLY these questions:
1. Does the test set up a scenario that exercises the rule's constraint?
2. Does the test check the specific outcome the proof description claims?
3. Is anything described in the proof missing from the test?
4. Does the assertion contain a tautological escape hatch (OR branch that always passes)?
5. Does the assertion validate test setup data instead of code-under-test output?
6. Does the test function name contradict the actual assertion values?
Rate each: STRONG (test matches rule intent), WEAK (test partially matches — something is missing or too loose), or EXCLUDED (structural presence check, not behavioral).
Do NOT check for structural issues — those were already handled.
For external LLM (audit_llm configured):
Same prompt, but wrapped in the structured response format:
For each proof, respond in EXACTLY this format:
PROOF-ID: PROOF-N
RULE-ID: RULE-N
ASSESSMENT: STRONG|WEAK|EXCLUDED
CRITERION: <what semantic aspect is missing, "matches rule intent" if STRONG, or "structural presence check" if EXCLUDED>
WHY: <what behavior would slip through, "test exercises the rule correctly" if STRONG, or "test verifies document content, not system behavior" if EXCLUDED>
FIX: <specific change to align test with rule, "none" if STRONG, or "none — exclude from audit" if EXCLUDED>
---
Note: the LLM can return STRONG, WEAK, or EXCLUDED in Pass 2. HOLLOW is exclusively determined by Pass 1 (deterministic). EXCLUDED proofs are structural — the pipeline excludes them from scoring.
Use the bordered output format with findings grouped by value tier (see references/audit_criteria.md § Finding Priority):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROOF AUDIT: <feature> (<N> proofs)
Criteria: <source> (Criteria-Version: N)
Auditor: Pass 1 — static_checks.py | Pass 2 — Claude (or external LLM name)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CRITICAL (fix first — tests prove nothing):
PROOF-4 (RULE-4): HOLLOW ✗ — no assertions
Why: test function has zero assert/expect statements
Fix: add assertions checking the response status and body
HIGH VALUE (real coverage gaps):
PROOF-2 (RULE-2): WEAK ~ — missing negative test
Why: rule says "reject invalid passwords" but test only checks valid login
Fix: add test with invalid password, assert 401 response
MEDIUM VALUE (self-confirming tests):
PROOF-6 (RULE-6): HOLLOW ✗ — logic mirroring
Why: expected = compute_hash(input) — same function as code under test
Fix: replace with precomputed literal: assert result == "5e884898..."
STRONG (no action needed):
PROOF-1 (RULE-1): STRONG ✓
PROOF-3 (RULE-3): STRONG ✓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AUDIT SUMMARY:
CRITICAL: N HIGH: N MEDIUM: N LOW: N STRONG: N MANUAL: N
Audited: N proofs (M cached, K fresh) | J structural excluded
Fix priority: N critical, then N high-value, then N medium
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Group findings by value tier (see references/audit_criteria.md § Finding Priority for the complete tier mapping). Within each tier, list HOLLOW before WEAK. Present tiers in this order: CRITICAL, HIGH, MEDIUM, LOW, STRONG.
When spawning the builder to fix findings, pass them in priority order: CRITICAL first, then HIGH, then MEDIUM. The builder fixes in that order. If the 3-round limit is reached, the highest-value findings have been addressed.
If HOLLOW or WEAK proofs found, append directives:
Fix proof quality in the build loop, then re-verify:
→ Run: test <feature> (fix PROOF-N: <what to fix>)
→ Run: purlin:verify
If any anchor rules have clarity issues, append a separate section:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RECOMMENDATIONS FOR ANCHOR AUTHORS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
security_no_eval (Source: git@github.com:acme/security-policies.git)
RULE-1: "No eval() calls in source code"
→ Suggest: clarify scope — does this include test files? Current wording is ambiguous.
prodbrief_checkout (Source: git@github.com:acme/product-briefs.git)
RULE-3: "Order confirmation email arrives within 60 seconds"
→ Suggest: specify what "arrives" means — delivered to SMTP server, or in user's inbox?
This section only appears when anchor rules have clarity issues. It's advisory — the anchor author decides whether to act.
When spawned by purlin:verify or another agent:
--load-criteria (see Step 1)When a HOLLOW or WEAK proof is for an anchor rule:
When .purlin/config.json has audit_llm set, the audit still runs Pass 1 (deterministic) first. Proofs that pass Pass 1 go to the external LLM for Pass 2 (classification + semantic evaluation).
--load-criteria — respects additional team criteria and --extra).You are evaluating semantic alignment between spec rules and test code.
Structural issues (assert True, no assertions, logic mirroring) have already been checked and passed.
SPEC PROOF DESCRIPTIONS:
<paste the ## Proof section from the spec — only proofs that passed Pass 1>
TEST CODE:
<paste the actual test function code for each proof>
For each proof, respond in EXACTLY this format:
PROOF-ID: PROOF-N
RULE-ID: RULE-N
ASSESSMENT: STRONG|WEAK
CRITERION: <what semantic aspect is missing, or "matches rule intent" if STRONG>
WHY: <what behavior would slip through, or "test exercises the rule correctly" if STRONG>
FIX: <specific change to align test with rule, or "none" if STRONG>
---
{prompt} in the configured command with the constructed prompt. Capture stdout.PROOF-ID:, ASSESSMENT:, CRITERION:, WHY:, FIX: lines. Be flexible — different LLMs format slightly differently. Look for the keywords, not exact whitespace.UNKNOWN — external LLM response could not be parsed and include the raw response excerpt.━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROOF AUDIT: <feature> (<N> proofs)
Criteria: references/audit_criteria.md (Criteria-Version: N)
Auditor: Pass 1 — static_checks.py | Pass 2 — Gemini Pro (external — cross-model)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The report header shows both audit passes.
When external LLM is configured, the lead relays findings:
[Gemini Pro audit] HOLLOW: login PROOF-3
Criterion: mocks the function being tested
Why: test passes even if bcrypt is misconfigured
Fix: remove mock, use real bcrypt call
The builder never calls the external LLM. The lead relays.
After writing all assessments to the cache, if this is a full audit (no specific feature argument), prune orphaned entries from deleted or renamed features. Collect all proof hashes that were computed during this audit (cache hits + fresh evaluations) into a temp file, one key per line:
# Write live keys to temp file
echo "<hash1>" > /tmp/purlin_live_keys.txt
echo "<hash2>" >> /tmp/purlin_live_keys.txt
# ... one line per proof hash computed during this audit
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/audit/static_checks.py --prune-cache --live-keys-file /tmp/purlin_live_keys.txt
This removes cache entries for features that no longer exist while preserving all entries from the current audit. For single-feature audits, skip this step — they don't know which other features are live.
After the audit report is complete and the cache has been written, call sync_status to compute the integrity score and refresh the dashboard:
sync_status()
The sync_status output includes the integrity percentage (computed by _compute_integrity() in purlin_server.py). Do not compute the integrity percentage yourself — always read it from the sync_status output. This ensures the audit CLI and the dashboard always show the same value from the same computation.
After sync_status completes, report the integrity score it returned:
INTEGRITY SCORE: <N>% (from sync_status, computed by _compute_integrity())
Formula: (STRONG + MANUAL) / (STRONG + WEAK + HOLLOW + MANUAL) — proof quality only.
Read-only. Never modify code or test files.
Independent. When spawned as a subagent, has fresh context — no memory of writing the tests.
Criteria-driven. All judgments reference the criteria document, not ad hoc opinions.
Transparent. The report shows the criteria version and source so anyone can verify the assessment was made against known standards.
Actionable recommendations. Every HOLLOW or WEAK finding includes three parts:
expected = hash_func(input) with expected = '5e884898da28...'")Bad fix recommendation: "Make the test stronger"
Good fix recommendation: "Remove the bcrypt.checkpw mock. Store a password via create_user('alice', 'secret'), retrieve the stored hash, assert bcrypt.checkpw(b'secret', stored_hash) returns True"