Help us improve
Share bugs, ideas, or general feedback.
From epistemic-skills
Extracts competitor claims from source papers and reproduces scores locally before quoting. Prevents citing unverified baseline numbers.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:baseline-reproductionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Related skills:** `/skill:research-question`, `/skill:preregistration`, `/skill:experiment-execution`, `/skill:falsification-review`, `/skill:verification-before-publication`
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Explores codebases via GitNexus: discover repos, query execution flows, trace processes, inspect symbol callers/callees, and review architecture.
Share bugs, ideas, or general feedback.
Related skills:
/skill:research-question,/skill:preregistration,/skill:experiment-execution,/skill:falsification-review,/skill:verification-before-publication
A baseline is not a number you remember. It is a chain of evidence you can inspect.
This skill has two hard steps, in order:
Skip step 1 and you compare against a paraphrase. Skip step 2 and you compare against hearsay. Either way, the claim is not fit to quote.
In this repo, baseline work should leave two artifacts:
baselines/{name}.md — source note with the paper URL, extracted claim, dataset provenance, Hugging Face validation, and pinned revisionsexperiments/{id}/baselines/{name}.md — reproduction note with the exact run you executed under the active hypothesis and locked judgeUse the state helpers in src/state/repo.ts to stay honest:
loadHypotheses(cwd), getActiveHypothesis(entries), getJudgeLock(cwd, id), computeJudgeHash(judgeRef, id), loadBaselines(cwd), getBaselineAgeDays(entry), and fileExists(path).
Core principle: the only baseline you may quote is the one whose source claim you extracted yourself and whose score you reproduced yourself.
| Need | Tool or file | Rule |
|---|---|---|
| Active experiment | loadHypotheses(cwd) + getActiveHypothesis(entries) | No active hypothesis, no baseline claim |
| Competitor paper | AlphaXiv alpha CLI | Find the actual paper before touching the number |
| Long paper | alpha get paper --section results | Read the results section first, then the full paper only if needed |
| Source note | baselines/{name}.md | Save the paper URL and extracted claim before reproduction |
| Judge integrity | getJudgeLock(cwd, id) + computeJudgeHash(judgeRef, id) | Missing or drifted lock blocks the run |
| HF dataset metadata | hf_dataset_info | Verify existence, splits, features, access status, and sha |
| HF dataset tree | hf_repo_files | Inspect repo structure before assuming files or configs |
| HF access gate | HF_TOKEN | Gated dataset without HF_TOKEN means stop |
| Drift detection | pinned dataset revision | Record commit SHA or revision tag, never floating main |
| Local evidence | experiments/{id}/baselines/{name}.md | Only the reproduced number is quoteable |
NO COMPETITOR NUMBER WITHOUT SOURCE EXTRACTION AND LOCAL REPRODUCTION
A paper number is context until you extract it from the source. A source-extracted number is still context until you reproduce it. Only the reproduced score under your locked contract is a baseline.
Use this skill before you write any sentence shaped like:
Use it when HypothesisEntry.baselineRef in HYPOTHESES.md points at a paper, competitor repo, benchmark card, leaderboard number, or public checkpoint.
Use it before writing comparison language into experiments/{id}/RESULTS.md, a draft, a PR description, or a decision memo.
Use it when you inherit an old reproduction and need to know whether it is still quoteable.
Use it again whenever the source note or reproduction note is older than 30 days.
Freshness is part of validity here.
Do not use this skill while the question is still vague.
Use /skill:research-question first.
Do not use this skill before experiments/{id}/prereg.md exists.
Use /skill:preregistration first.
Do not use this skill for internal ablations already scored under the same locked judge in the same run family.
Do not use this skill to rescue a comparison claim after you already wrote it.
Delete the unsupported claim first.
Then reproduce the baseline.
Do not use this skill as a paper-reading substitute.
Reading is input gathering.
Baseline reproduction is evidence generation.
A reproduced baseline means all of this is true:
baselines/{name}.mdhf_dataset_info and hf_repo_filesexperiments/{id}/judge.lock instead of assuming it was fineexperiments/{id}/baselines/{name}.mdIf any one of those is false, you do not have a reproduced baseline yet.
loadHypotheses(cwd).getActiveHypothesis(entries).HypothesisEntry carefully: id, claim, judgeRef, baselineRef, costCap, and status.baselineRef as a lead, not proof.HYPOTHESES.md and experiments/{id}/prereg.md.{name}.baselines/{name}.md and experiments/{id}/baselines/{name}.md.alpha CLI to search for the competitor paper.alpha get paper --section results.baselines/{name}.md.getJudgeLock(cwd, active.id).null, stop.computeJudgeHash(active.judgeRef, active.id).experiments/{id}/judge.lock.active.judgeRef in the reproduction note.writeJudgeLock(...) here just to paper over drift.active.judgeRef is the contract you are trying to preserve.not comparable and stop pretending it is head-to-head.hf_dataset_info before you run anything.url and sha.HF_TOKEN is not set, stop.hf_repo_files to inspect the repo structure before assuming config names, file layout, or split assets.sha returned by hf_dataset_info as the pin.main, latest, or "current" as the dataset version.baselines/{name}.md.experiments/{id}/baselines/{name}.md.latest, main, and current are not versions.unknown and say why.baselines/{name}.md.experiments/{id}/baselines/{name}.md.fileExists(path) if you need a direct existence check.BASELINES.md, call loadBaselines(cwd).BaselineEntry.getBaselineAgeDays(entry).baselines/{name}.md as soon as source extraction is complete and again if reproduction clarifies provenance.experiments/{id}/baselines/{name}.md as soon as the run ends.active.judgeRef, verified judge.lock hash, full reproduction command, and any material contract differences.unknown, not omitted.not comparable.HYPOTHESES.md or downstream result drafts if the reproduced baseline changes the story.baselines/{name}.mdUse a structure that makes source drift obvious. Do not hide uncertainty in prose. Make the unknowns explicit.
# Baseline Source: <name>
- **Paper URL:** <exact paper url>
- **Date retrieved:** <YYYY-MM-DD>
- **Extraction path:** <alpha results section first | full paper required>
- **Extracted claim:** <verbatim or tightly quoted claim from the paper>
- **Table or figure:** <table 2 | figure 4 | unknown>
- **Metric:** <metric>
- **Dataset:** <dataset>
- **Split:** <split>
- **Competitor version:** <tag|commit|checkpoint|snapshot|unknown>
- **HF dataset repo:** <org/name|none>
- **HF access status:** <public|gated|private|unknown>
- **HF pinned revision:** <sha|tag|none|unknown>
- **HF splits verified:** <yes|no|unknown>
- **HF features verified:** <yes|no|unknown>
- **HF repo structure checked:** <yes|no>
- **Notes:** <caveats, ambiguities, or why full paper was needed>
This file exists so you can later prove what the competitor claimed before you touched their code. If the source claim itself is muddy, the reproduction note should inherit that uncertainty instead of pretending it vanished.
experiments/{id}/baselines/{name}.mdUse a structure that makes missing provenance obvious.
# Baseline: <name>
- **Hypothesis ID:** <id>
- **Claim under test:** <claim>
- **Source note:** `baselines/<name>.md`
- **Source URL:** <url>
- **Extracted claim:** <reported number and wording from source note>
- **Competitor version:** <tag|commit|checkpoint|snapshot|unknown>
- **Date retrieved:** <YYYY-MM-DD>
- **Task:** <task>
- **Dataset:** <dataset>
- **Split:** <split>
- **Metric:** <metric>
- **Dataset revision:** <sha|tag|unknown>
- **Judge ref:** <active.judgeRef>
- **Judge lock hash:** <contents of experiments/{id}/judge.lock>
- **Source score:** <reported number or not stated>
- **Reproduced score:** <measured number or failed to reproduce>
- **Reproduction command:** `<full command>`
- **Environment pins:** <versions, commit hashes, dataset revision>
- **Contract differences:** <none or exact differences>
- **Quoteable:** <yes|no>
- **Quoteability reason:** <why>
## Notes
<plain-language explanation of mismatches, failures, or caveats>
If you also mirror the baseline into a repo-level BASELINES.md, keep it parseable by loadBaselines(cwd).
That means fields compatible with BaselineEntry: name, url, score, judge, version, and retrieved.
The durable run-specific artifact is still experiments/{id}/baselines/{name}.md.
That file is the local evidence for the quoted number.
Someone copied the number from a blog post, README, issue comment, or leaderboard row.
That is not source extraction.
Use the paper itself.
If the paper is long, start with alpha get paper --section results.
Only read the full paper when the results section does not settle the claim.
Reading the whole paper first feels thorough. Often it is just undisciplined. If the critical number lives in the results section, extract it there first so you do not let surrounding narrative rewrite the claim in your head.
The dataset name appears in the paper, so somebody assumes the Hub repo still exists, still has the same splits, and is still public.
That is how silent dataset drift becomes fake progress.
Call hf_dataset_info and hf_repo_files.
Verify the actual repo, structure, and access state.
Public metadata is not public access.
If the dataset is gated and HF_TOKEN is not set, stop.
Do not build a comparison on a dataset you cannot honestly inspect or fetch.
main is not a pin.
latest is not a pin.
If you did not record a commit SHA or revision tag, you created an invisible moving part in the baseline.
A different judge is a different experiment.
Human preference, a provider eval, and active.judgeRef are not interchangeable.
Re-score under your lock or drop the head-to-head claim.
A dev-set number is not a test-set number. Macro average is not micro average. Pairwise win rate is not exact-match accuracy. If the contract moved, the baseline moved.
A paper number without a local run is context, not a baseline. You may cite it as published context. You may not use it as the thing you beat.
| Excuse | Reality |
|---|---|
| "The paper already reports the number." | Reported is not extracted and extracted is not reproduced. |
| "I found the score on a leaderboard." | Leaderboards copy claims; they do not replace the source paper. |
| "The paper is short enough that I can just skim it." | Skimming is how table notes, split caveats, and metric qualifiers disappear. |
| "The results section probably says the same thing as the rest." | Then prove it quickly with alpha get paper --section results before spending more time. |
| "The HF dataset repo looks standard." | Standard-looking repos still drift, gate, rename configs, and change files. Verify it. |
| "The dataset is gated but the card is public." | Public metadata is not usable data. No HF_TOKEN, no honest validation. |
| "I can pin the dataset revision later." | Later means after the comparison already floated. Pin it now. |
| "Using the source score is more conservative." | No. Use the score you actually measured under your contract. |
| "The artifact already exists." | Existing and quoteable are different states. Check completeness, lock, freshness, and pins. |
| "I can reproduce it after I finish my own run." | Then you are incentivized to move the baseline to fit the story. |
Stop immediately if any of these are true:
getActiveHypothesis(...) returns nothingexperiments/{id}/prereg.md is missingalpha get paper --section results and are guessing from memory anywaygetJudgeLock(cwd, id) returns nullcomputeJudgeHash(active.judgeRef, active.id) does not match experiments/{id}/judge.lockhf_dataset_infohf_repo_filesHF_TOKEN is not setAll of these mean the comparison is not ready to quote.
Good:
"I found the competitor paper with the alpha CLI, used alpha get paper --section results, extracted the exact figure from Table 3, and saved the URL plus claim to baselines/model-x.md."
Bad:
"I remembered the paper was around 84 and copied the leaderboard number into the baseline note."
Good:
"I called hf_dataset_info to verify the dataset exists, inspected splits and features, saw it returns sha, and used hf_repo_files to confirm the repo layout before pinning the dataset revision."
Bad:
"The paper says it uses org/dataset, so I assumed the Hub repo still matches the paper."
Good:
"hf_dataset_info showed gated: true, HF_TOKEN was unset, so I blocked the baseline instead of pretending the dataset was usable."
Bad:
"The metadata page loads in a browser, so I treated the gated dataset as effectively public."
Good: "The source reports 84.1, but my locked-judge reproduction is 86.4, so the comparison uses 86.4." Bad: "The paper headline is lower, so I used that to stay consistent with the literature."
Good:
"The competitor reports dev-set human ratings, while this hypothesis is about test-set locked-judge win rate, so I recorded the source and marked the comparison not comparable."
Bad:
"The tasks are similar, so the comparison is directionally fair."
Most fake wins are not outright fabrication. They are baseline discipline failures. A paraphrased paper claim, a gated dataset you never really inspected, a floating dataset revision, a different judge, or a stale reproduction is enough to manufacture progress that is not real.
This skill forces the comparison onto one explicit contract. It makes the source claim auditable. It makes dataset drift visible. It gives later review real evidence instead of folklore. And it saves you from shipping a claim that only existed because nobody made you extract the paper number and run the baseline yourself.
After this, use /skill:experiment-execution