Help us improve
Share bugs, ideas, or general feedback.
From epistemic-skills
Runs adversarial falsification on numerical and comparison claims before they leave `smokes/` or enter `RESULTS.md`, requiring ≥2 independent adversary models.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:falsification-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Related skills:** `/skill:baseline-reproduction` for reproduced comparison targets, `/skill:experiment-execution` for preregistered runs, `/skill:surprise-triage` when a reviewed claim still behaves strangely, `/skill:verification-before-publication` before anything leaves the repo.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Explores codebases via GitNexus: discover repos, query execution flows, trace processes, inspect symbol callers/callees, and review architecture.
Share bugs, ideas, or general feedback.
Related skills:
/skill:baseline-reproductionfor reproduced comparison targets,/skill:experiment-executionfor preregistered runs,/skill:surprise-triagewhen a reviewed claim still behaves strangely,/skill:verification-before-publicationbefore anything leaves the repo.
This is the phase where the claim stops being yours and starts being evidence. Your job is not to help the claim survive; your job is to find the cheapest honest way to kill it before anyone else does.
A claim that only survives friendly reading is not a result. A claim that survives reproduced baselines, independent adversaries, and explicit disconfirming checks might be a result.
In this phase, you must:
RESULTS.md or from the researcher's explicit statementexperiments/{id}/prereg.mdexperiments/repro_{name}/prereg.mdrunFalsificationAdversary() from src/adversary/dispatch.ts against the exact claimexperiments/{id}/falsifiers/{model}.mdALLOW, BLOCK, or ALLOW WITH CAVEATexperiments/{id}/smokes/ until falsification passessmokes/ to RESULTS.md only after this phase clears the claimThe gates in this repo help, but they do not replace judgment. If a gate missed something, that does not grant permission.
Your claim is guilty until proven defensible by ≥2 models
One model is not enough. Self-critique is not independence. Missing audit is not support.
Use this skill when:
experiments/{id}/smokes/experiments/{id}/RESULTS.mdRESULTS.mdbeats, outperforms, matches, wins, better than, or any comparison claimCan we say this?Do not use this skill when:
/skill:research-question/skill:preregistration/skill:baseline-reproduction/skill:experiment-executionexperiments/{id}/smokes//skill:verification-before-publicationThese files and functions are the working surface for this phase. Read them. Use them. Do not guess.
| Surface | Why it matters |
|---|---|
HYPOTHESES.md | Registered claim, falsifier, baseline reference, judge reference |
experiments/{id}/prereg.md | Proof that the experiment existed before results |
experiments/{id}/judge.lock | Proof that the judge did not drift |
experiments/repro_{name}/prereg.md | Mandatory proof that the baseline was actually reproduced |
experiments/{id}/smokes/ | Provisional outputs only |
experiments/{id}/RESULTS.md | Confirmed experiment-level conclusions |
RESULTS.md | Optional repo-level mirror of confirmed conclusions |
experiments/{id}/falsifiers/{model}.md | Per-model falsification record |
.epistemic/cost-ledger.jsonl | Cost accounting for audit reruns |
OVERRIDES.md | Explicit exceptions with reasons |
src/state/repo.ts | Canonical state helpers and types |
src/adversary/dispatch.ts | Adversary dispatch and verdict parsing |
State helpers you will actually use here:
loadRepoState(cwd) for a quick snapshotloadHypotheses(cwd) and getActiveHypothesis(entries) to recover the active hypothesisloadBaselines(cwd) and getBaselineAgeDays(entry) to inspect baseline references and freshnessfileExists(path) to prove required artifacts existgetJudgeLock(cwd, id) and computeJudgeHash(judgeRef, id) to verify judge continuitygetHypothesisSpend(cwd, id) or getAllHypothesisSpends(cwd) to inspect accumulated audit costappendCostRecord(cwd, record) if manual reruns created unlogged costupdateHypothesisStatus(cwd, id, status) when the claim is actually deadOther helpers exist in src/state/repo.ts. Do not use saveHypotheses(...), hypothesisToMarkdown(...), or writeJudgeLock(...) here to paper over missing earlier work. This phase audits reality. It does not rewrite history.
src/adversary/dispatch.ts exposes runFalsificationAdversary({ claim, context, cwd }). Keep the hypothesis id in surrounding workflow even though the function itself does not currently take it; you still need it for file paths and status decisions.
Start with the sentence that is asking for authority. Never falsify a vibe. Never falsify a softer paraphrase than the one about to ship.
Steps:
HYPOTHESES.md with loadHypotheses(cwd).getActiveHypothesis(...).RESULTS.mdexperiments/{id}/RESULTS.mdIf the sentence does not say what improved, against what, on which task, and under which conditions, stop and rewrite it into a falsifiable claim before review.
Falsification review is not a substitute for missing protocol. If the experiment itself is unanchored, every later decision is fake confidence.
Steps:
experiments/{id}/prereg.md with fileExists(...).loadRepoState(cwd) if you need a fast snapshot of HYPOTHESES.md, BASELINES.md, and root RESULTS.md.HypothesisEntry fields from src/state/repo.ts.A sentence can be numerically true on one metric and still invalid because it is no longer the preregistered claim.
Comparison claims are meaningless if the judge drifted halfway through the experiment. This repo has a judge-lock mechanism. Use it.
Steps:
HypothesisEntry.judgeRef.getJudgeLock(cwd, id) to check whether experiments/{id}/judge.lock exists.computeJudgeHash(judgeRef, id).writeJudgeLock(...) here.Late repair is not integrity. It is paperwork after the fact.
This is where comparison claims usually die. A cited baseline is not a reproduced baseline. BASELINES.md is context, not proof.
Steps:
HypothesisEntry.baselineRef to identify the intended comparison target.loadBaselines(cwd) to locate the named baseline entry if one exists.{name} token.experiments/repro_{name}/prereg.md.fileExists("experiments/repro_{name}/prereg.md") is false, BLOCK.Baseline not reproduced under your protocol. Missing experiments/repro_{name}/prereg.md.BASELINES.md, use getBaselineAgeDays(...) to check freshness.The distinction is non-negotiable:
BASELINES.md proves the number was recordedexperiments/repro_{name}/prereg.md proves the baseline was reproduced under your protocolIf you cannot prove the second, you do not get to say beats.
runFalsificationAdversary() is only as good as the packet you send it. Friendly framing produces friendly nonsense.
Steps:
src/adversary/dispatch.ts: runFalsificationAdversary({ claim, context, cwd }).context from repo facts, not hype.experiments/{id}/prereg.mdexperiments/{id}/smokes/experiments/repro_{name}/prereg.md when the claim is comparativeHypothesisEntry.judgeRefHypothesisEntry.judgeRef or from the active drafting context if it is more specific.Each model must return the core payload: {experiment, costEstimate, verdict, reasoning}.
The canonical repo-side form is AdversaryVerdict in src/state/repo.ts: { provider, model, name, experiment, costEstimate, verdict, reasoning }.
The goal of the run is the strongest disconfirming experiment, not reassurance.
Steps:
runFalsificationAdversary({ claim, context, cwd }).experiment, costEstimate, verdict, and reasoning.cannot-audit from the set of passing audits.Read src/adversary/dispatch.ts carefully before trusting the response shape blindly. The happy path returns canonical fields, but some fallback paths still emit legacy keys like modelId, providerId, and disconfirmingExperiment. If the response is malformed, normalize it before writing files and do not count the malformed response as a clean pass.
Never make the verdict from scrollback memory. Write the evidence first.
Steps:
experiments/{id}/falsifiers/.experiments/{id}/falsifiers/{model}.md.already been run from repo artifacts, not from confidence.experiments/{id}/smokes/, experiments/{id}/RESULTS.md, and existing falsifier files.appendCostRecord(cwd, record).getHypothesisSpend(cwd, id) or getAllHypothesisSpends(cwd).Planned is not run. Mentioned is not run. Half-run is not run.
Once the evidence exists on disk, decide the outcome from the matrix below and nothing softer.
| Condition | Outcome | Required action |
|---|---|---|
All independent audited verdicts are defensible | ALLOW | Promotion may proceed if no cheap unrun killer remains |
Any independent audited verdict is falsified-or-unreproducible | BLOCK | Stop promotion and record the blocking reasoning |
Mixed audited verdicts with at least one caveat-required and none falsified-or-unreproducible | ALLOW WITH CAVEAT | Tag the claim and ship the caveat with it |
Fewer than two independent audited verdicts remain after excluding cannot-audit | BLOCK | Fix the adversary roster or configuration first |
| Required baseline reproduction is missing | BLOCK | Reproduce the baseline first |
Cheapest disconfirming experiment is <$1 and unrun | BLOCK | Run the cheap killer before promoting |
Short form:
ALLOWBLOCKALLOW WITH CAVEATBLOCK<2 independent audited models -> BLOCK<$1 -> BLOCKALLOW WITH CAVEAT is not decorative. Tag the claim. Attach the limitation where the claim appears. Do not bury it in another file.
If the claim is actually dead, use updateHypothesisStatus(cwd, id, "FALSIFIED").
This is the part people try to skip because the number looks small. Do not help them do that.
Steps:
experiment and costEstimate from the adversary set.$1 and is unrun, BLOCK.Claim blocked pending cheapest disconfirming experiment (<$1) requested by falsification review.If the strongest cheap killer is still sitting there unrun, the claim has not earned authority.
Promotion is the point where a provisional observation becomes repo memory.
Steps:
BLOCK, keep the evidence in experiments/{id}/smokes/ and keep the claim out of every RESULTS.md.ALLOW, promote the reviewed result from experiments/{id}/smokes/ to experiments/{id}/RESULTS.md.RESULTS.md only after the experiment-level file is correct.ALLOW WITH CAVEAT, promote only the reviewed sentence and attach the caveat directly beside it./skill:kill-or-ship.Promotion rule: only after falsification passes may anything move from smokes/ to RESULTS.md.
Laundering rule: promote now, rerun the falsifier later is not a workflow.
A falsification review is not complete until the record is internally consistent.
Steps:
experiments/{id}/falsifiers/./skill:surprise-triage.| Excuse | Reality |
|---|---|
BASELINES.md already has the competitor number. | A cited number is not reproduced evidence. Proof is experiments/repro_{name}/prereg.md. |
| One strong model audit is enough. | The Iron Law requires ≥2 independent model reviews. |
| The drafter model challenging itself should count. | Self-critique is not independence. Same model does not count. |
| The draft sentence is basically the same as the reviewed one. | If the sentence changed, the claim changed. Re-review the shipped sentence. |
| The baseline paper is recent, so reproduction can wait. | Fresh citation is still not reproduction under your judge. |
cannot-audit is close enough to neutral. | Missing audit does not become supportive evidence. |
| The cheap killer is probably noise. | Then run it. If it costs <$1, guessing is more expensive than checking. |
| Mixed verdicts mostly mean yes. | Mixed means ALLOW WITH CAVEAT, not clean ALLOW. |
| We already spent enough on this hypothesis. | Then escalate to kill-or-ship. Do not lower the falsification bar. |
smokes/ looks stable, so promotion is safe. | Stable-looking provisional data is still provisional data. |
| We can promote now and clean up the caveat later. | If the caveat does not ship with the claim, the claim is misrepresented. |
| The function returned something shaped like a verdict, so it counts. | Malformed or legacy fallback output must be normalized and may still fail the audit minimum. |
Stop immediately if any of these are true:
experiments/{id}/prereg.md is missingexperiments/{id}/judge.lock is missing or driftedexperiments/repro_{name}/prereg.md is missingBASELINES.md as reproduction proofcannot-auditfalsified-or-unreproducible<$1 and you cannot prove it was runsmokes/ into RESULTS.md before resolving the matrixALLOW WITH CAVEAT a clean passAny one of these means stop. Do not promote. Fix the broken premise first.
Good
You load HYPOTHESES.md with loadHypotheses(cwd), recover hyp-017, and review the exact sentence in experiments/hyp-017/RESULTS.md: Model A beats Model B by 4.2 points on Dataset D under judge J. The adversary sees that exact sentence.
Bad
You summarize it as the model seems better on Dataset D before sending it for review. Now the falsifier is auditing a softer claim than the one that would ship.
Good
You inspect HypothesisEntry.baselineRef, normalize the comparison target, and verify experiments/repro_llama-3/prereg.md exists and matches the task and judge.
Bad
You find ## Baseline: llama-3 in BASELINES.md and decide that is enough. It is not. That proves the number was recorded, not reproduced.
Good You require two adversary models different from the drafter model and exclude any same-model self-audit from the pass count.
Bad You count the drafter model's self-critique as one of the two required adversaries. That is the same bias wearing a hat.
Good
Two adversaries propose disconfirming experiments costing $0.42 and $2.10. The $0.42 check has not been run, so you block with Claim blocked pending cheapest disconfirming experiment (<$1) requested by falsification review.
Bad
You say the $0.42 check is too minor to delay promotion. The cheaper the killer, the less excuse you have not to run it.
Good
One adversary returns defensible; another returns caveat-required. No audited verdict says falsified-or-unreproducible, and no cheap unrun killer remains. You tag the claim ALLOW WITH CAVEAT and ship the caveat with the sentence.
Bad
You average the two judgments into a clean ALLOW. That is how caveats disappear and overclaiming starts.
Good
You keep provisional results inside experiments/{id}/smokes/ until the matrix resolves. Only then do you promote to experiments/{id}/RESULTS.md, and only then do you mirror anything into root RESULTS.md.
Bad
You copy a promising sentence into RESULTS.md first and promise to run falsification afterward. That is not sequencing. That is laundering provisional numbers.
Most bad claims do not survive because the experiment runner is malicious. They survive because nobody forced the claim through hostile review before it became institutional memory.
This phase matters because it:
smokes/ provisional and RESULTS.md meaningfulexperiments/{id}/falsifiers/ALLOW, BLOCK, and ALLOW WITH CAVEATOnce a sentence lands in RESULTS.md, people will reuse it. They will quote it in docs, planning, reviews, and future experiments. If the sentence is wrong, the error compounds.
That is why this phase is strict. If the claim cannot survive reproduced baselines, independent adversaries, and cheap disconfirming checks inside the repo, it has no business representing the repo outside it.
After this, use /skill:surprise-triage.