Search everything...

Skill

experiment-execution

Executes preregistered hypotheses under locked methods, generating provisional evidence without contaminating headline outputs. For runs that respect preregistration, judge locks, and cost accounting.

developer-tools

npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skills

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/epistemic-skills:experiment-execution

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> **Related skills:** `/skill:preregistration`, `/skill:baseline-reproduction`, `/skill:falsification-review`, `/skill:kill-or-ship`

SKILL.md

433 lines · ~5.6k tokens(exceeds 5k compaction limit)

Similar Skills

ui-ux-pro-max

90.2k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

context7-mcp

55.5k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

gitnexus-exploring

38.9k

Explores codebases via GitNexus: discover repos, query execution flows, trace processes, inspect symbol callers/callees, and review architecture.

1 file

gitnexus

Stats

LanguageTypeScript

Stars7

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

experiment-execution | epistemic-skills

Skill

experiment-execution

From epistemic-skills

developer-tools

npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skills

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/epistemic-skills:experiment-execution

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> **Related skills:** `/skill:preregistration`, `/skill:baseline-reproduction`, `/skill:falsification-review`, `/skill:kill-or-ship`

SKILL.md

433 lines · ~5.6k tokens(exceeds 5k compaction limit)

Related skills: /skill:preregistration, /skill:baseline-reproduction, /skill:falsification-review, /skill:kill-or-ship

Experiment Execution

Overview

Execution is where method turns into evidence. Not where you redesign the method, switch carriers because today is inconvenient, or leak a promising number into a headline file.

Your job is narrow:

load the active hypothesis from repo state
obey its computeTarget
enforce prereg.md, judge.lock, and environment.lock before launch
run the committed sample size n
keep outputs provisional under experiments/{id}/smokes/
append a real compute row after every run with appendCostRecord(...)

Core principle: execution is measurement under a locked contract. Routing, environment, and cost accounting are part of that contract. If you improvise on any of them, downstream review cannot rescue the result.

Quick Reference

Need	File or API	Rule
Load hypothesis state	`loadHypotheses(cwd)`	Start from repo state, not memory
Select the live experiment	`getActiveHypothesis(entries)`	Do not guess the active `id`
Read execution target	`HypothesisEntry.computeTarget`	`local`, `docker`, or `modal` is part of the method
Mark execution start	`updateHypothesisStatus(cwd, id, "RUNNING")`	Bookkeeping only
Check preregistration	`fileExists("experiments/{id}/prereg.md")`	No prereg, no run
Read judge lock	`getJudgeLock(cwd, id)`	Missing or drifted judge lock blocks scored execution
Read environment lock	`getEnvironmentLock(cwd, id)`	Missing or drifted environment lock blocks all execution
Compare judge hash	`computeJudgeHash(judgeRef, id)`	Check before the first scored call
Read spend	`getHypothesisSpend(cwd, id)`	Watch the cap before more runs
Split spend by type	`getHypothesisSpendByCategory(cwd, id)`	Separate `llm` from `compute` burn
Append compute cost	`appendCostRecord(cwd, record)`	Use `category: "compute"` after every attempted run
Local or docker env contract	active `Dockerfile` + `requirements.txt`	Hash the exact pair used for the run
Modal env contract	`experiments/{id}/modal-app.py`	Hash the exact file you will execute
Docker writable path	`experiments/{id}/`	Mount this read-write; everything else read-only
Provisional artifacts	`experiments/{id}/smokes/`	Logs and raw outputs live here
Headline files	`experiments/{id}/RESULTS.md` and any root `RESULTS.md`	Do not write here yet

The Iron Law

NO RUN WITHOUT LOCKS; NO RUN WITHOUT A LEDGER ROW

Every legitimate run leaves four traces:

experiments/{id}/prereg.md
the relevant lock files
provisional artifacts under experiments/{id}/smokes/
a cost row in .epistemic/cost-ledger.jsonl

If any trace is missing, you do not have evidence. You have a story.

When to Use

Use this skill when:

experiments/{id}/prereg.md already exists
a hypothesis in HYPOTHESES.md is OPEN or RUNNING
you are about to write experiment code, launch a benchmark, or call a model-backed evaluator
the run will create provisional outputs that must stay out of headline files
the active hypothesis specifies computeTarget: local, docker, or modal
the run consumes real budget and every attempt must land in the ledger
environment.lock and, if applicable, judge.lock must be enforced before launch
you need to finish the preregistered sample size n and compute the promised summary

When NOT to Use

Do not use this skill when:

the claim is still being framed — use /skill:research-question
the method is still being locked — use /skill:preregistration
you are reproducing an external baseline — use /skill:baseline-reproduction
you are attacking the finished result — use /skill:falsification-review
you are making the terminal decision — use /skill:kill-or-ship
you want one “quick run” before preregistration or lock files are settled

A quick run that changes what you believe is not a harmless preview. It is a real run with missing governance.

The Process

1. Identify the active hypothesis and its carrier

Call loadHypotheses(cwd).
Call getActiveHypothesis(entries).
Read the execution-critical fields from the live entry: id, claim, falsifier, n, judgeRef, baselineRef, costCap, computeTarget, and status.
Treat computeTarget as part of the preregistered method.
Valid values are local, docker, and modal.
Do not infer the target from convenience, machine state, installed tools, or personal preference.
Derive the working paths immediately:
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
- experiments/{id}/environment.lock
- experiments/{id}/modal-app.py
- experiments/{id}/smokes/
- experiments/{id}/RESULTS.md
- .epistemic/cost-ledger.jsonl
If execution is legitimately starting now, move OPEN to RUNNING with updateHypothesisStatus(cwd, id, "RUNNING").
Do not mark anything CONFIRMED here.
Execution measures.
Review interprets.

2. Refuse to run without `prereg.md`

Confirm experiments/{id}/prereg.md exists with fileExists(path).
If it does not exist, stop immediately.
Do not run bun, python, pytest, benchmark, train, docker, or modal commands and promise yourself you will document the method later.
Read the preregistration and extract the operational contract:
- planned sample size n
- task or dataset slice
- metric definition
- judge configuration if scoring is involved
- stopping rule
- retry or exclusion policy
- execution carrier if the prereg makes it explicit
Compare that contract against the active HypothesisEntry.
If n, judgeRef, baselineRef, or computeTarget disagree between preregistration and HYPOTHESES.md, repair the inconsistency before running.
Treat preregistration as executable governance.
It is not a diary entry.

3. Enforce the judge lock before any scored evaluation

Read judgeRef from the active hypothesis.
Load the stored lock with getJudgeLock(cwd, id).
Compute the expected value with computeJudgeHash(judgeRef, id).
Compare the two before the first scored call, not after the number is already on your screen.
If the lock is missing and this is the first legitimate locked run, create it deliberately with writeJudgeLock(cwd, id, judgeRef).
If the lock exists and does not match, stop.
Do not rationalize that the prompt change was cosmetic or the model version shift was basically the same.
Judge drift is method drift.

4. Enforce the environment lock before any compute launches

Load the stored lock with getEnvironmentLock(cwd, id).
If it returns null, stop.
Execution reads environment.lock.
It does not mint or rewrite it mid-run.
For local and docker, the environment contract is the exact Dockerfile and requirements.txt registered for this run.
If the repo could plausibly contain more than one pair, preregistration must name the pair.
Execution does not guess.
Hash the current contents of that Dockerfile plus requirements.txt in a deterministic order and compare the result to environment.lock.
If either file is missing, stop.
Missing dependency files are a contract failure, not a setup detail.
For modal, the environment contract is experiments/{id}/modal-app.py.
Write that file from the locked protocol if the modal target requires it, then treat the exact file you will execute as frozen.
Hash the current contents of modal-app.py and compare the result to environment.lock.
If the stored lock and current hash differ, stop.
That is environment drift.
Do not “just refresh” environment.lock after editing Dockerfile, requirements.txt, or modal-app.py.
The lock exists to prevent exactly that move.
Local and docker share an environment contract.
The carrier may differ.
The locked dependencies do not.

5. Arm the cost ledger before the first run

The canonical ledger is .epistemic/cost-ledger.jsonl.
Read the current total with getHypothesisSpend(cwd, id).
If you need the split, read getHypothesisSpendByCategory(cwd, id).
If you need portfolio context, read getAllHypothesisSpends(cwd).
Treat compute cost as first-class spend, not invisible overhead.
Plan to append one CostRecord with category: "compute" after every attempted run.
Failed launches still count as attempts and still get a ledger row.
Do not batch cost entries at the end of the day.
If the remaining budget cannot support the committed run set, stop and hand off to the override or kill flow.

6. Route execution from `computeTarget`, not from convenience

Branch strictly on h.computeTarget.
local means run in a virtual environment under the locked dependency contract.
Use the venv as the carrier, not as permission to install extras or patch dependencies ad hoc.
Capture the merged stdout/stderr and persist it to experiments/{id}/smokes/run-{n}.log.
docker means build from the registered Dockerfile, then execute inside a container.
Run from /work inside the container.
Mount the repo read-only and mount experiments/{id}/ read-write.
The safe shape is:

docker run --rm \
  -v "$(pwd):/work:ro" \
  -v "$(pwd)/experiments/{id}:/work/experiments/{id}:rw" \
  -w /work \
  <image> <command>

That mount pattern is containment, not decoration.
The container must not rewrite unrelated files.
Capture build and run output.
If you use one log per run, it must be complete enough to audit the launch.
modal means write experiments/{id}/modal-app.py with @modal.app() and @modal.function() decorators.
Install dependencies inside the Modal image definition, not ad hoc on the host.
Launch with modal run experiments/{id}/modal-app.py.
Capture the merged stdout/stderr from that command and persist it to experiments/{id}/smokes/run-{n}.log.
Never switch carriers mid-experiment because another path feels easier today.
Carrier drift is method drift.

7. Append compute cost immediately after each run

After every attempted run, call appendCostRecord(cwd, record).
Use the real CostRecord shape from src/state/repo.ts.
Set category: "compute".
For local and docker, record estimatedCost: 0.
In this repo, those carriers are tracked as compute work with zero billed external cost.
For modal, record estimatedCost = gpuSeconds × rate.
Fix rate before the run.
Do not reverse-engineer it after seeing the result.
Set toolName to the actual backend.
Set isError to reflect whether the run failed.
Append the row immediately after the run ends.
Do not rely on memory.
Do not leave Modal compute blank because estimating it is annoying.

8. Finish the committed sample size

Do not summarize after 3 of 30 runs because the trend looks obvious.
Do not stop early because the result is already good enough for a slide.
Do not stop early because the result is ugly and you would rather redesign the method.
Complete the full n unless the preregistered stopping rule explicitly says otherwise.
Keep the important invariants fixed across runs:
- same dataset slice
- same prompt and extraction rule
- same judge
- same model settings
- same metric definition
- same computeTarget
Do not peek at early outputs and then revise the method mid-run.
That means no prompt edits, no seed swaps, no ad hoc retries, no threshold tweaks, no carrier changes, and no selective cleanup after seeing weak runs.
If infrastructure actually fails, follow the retry rule that was already written in preregistration.
If preregistration did not define the failure policy, stop and repair the protocol before continuing.

9. Store provisional artifacts in `smokes/` and nowhere else

Use experiments/{id}/smokes/ as the provisional artifact directory.
Put raw per-run outputs there with deterministic names such as:
- run-001.log
- run-001.json
- run-002.log
- aggregate.md
- notes.md
Keep experiment-local evidence under the experiment directory even if the repo scaffold also contains a top-level smokes/.
Treat everything in experiments/{id}/smokes/ as provisional and non-quotable.
Non-quotable means the number does not belong in:
- experiments/{id}/RESULTS.md
- a root RESULTS.md
- a README
- a PR description
- a commit message
- a polished claim sentence
Working notes belong in smokes/ too if they mention provisional numbers.
Do not write to experiments/{id}/falsifiers/{model}.md here.
That directory belongs to later adversarial review.

10. Compute the prescribed summary only after collection ends

Once the full sample is complete, compute the summary statistics named in preregistration.
Mean, median, pass rate, win rate, confidence interval, and error rate are valid only if they were planned.
Do not invent a new flattering metric because the original one underperformed.
Keep raw observations and summary outputs side by side inside experiments/{id}/smokes/.
Raw data without a summary is hard to audit.
Summary without raw data is easy to manipulate.
If the completed summary is surprising, note the surprise.
Do not retroactively rewrite the method.

11. Stop at the publication boundary

When execution is complete, stop.
Do not write to experiments/{id}/RESULTS.md yet.
If the repo also maintains a root RESULTS.md, do not write there either.
Do not upgrade the hypothesis to CONFIRMED just because the mean looks good.
Do not write a headline sentence like “we beat baseline X by Y%” before later review.
The output of execution is a clean handoff package, not a claim:
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
- experiments/{id}/environment.lock
- experiments/{id}/modal-app.py if computeTarget is modal
- experiments/{id}/smokes/ artifacts
- the relevant rows in .epistemic/cost-ledger.jsonl
- the active hypothesis entry in HYPOTHESES.md
If execution exhausted the cost cap, say so plainly.
The next question is not whether this can be published.
The next question is whether the provisional result survives scrutiny.

Reading the Cost Ledger

.epistemic/cost-ledger.jsonl is JSON Lines: one JSON object per line, append-only. Do not treat it like one array and do not rewrite history to make spend look cleaner.

The CostRecord shape defined in src/state/repo.ts is:

interface CostRecord {
  timestamp: string;
  hypothesisId: string;
  toolName: string;
  estimatedCost: number;
  category: "llm" | "compute";
  isError: boolean;
}

Execution adds category: "compute" rows. Examples:

{"timestamp":"2026-05-31T18:04:11.233Z","hypothesisId":"h-rag-precision","toolName":"compute:local","estimatedCost":0,"category":"compute","isError":false}
{"timestamp":"2026-05-31T18:12:44.901Z","hypothesisId":"h-rag-precision","toolName":"compute:docker","estimatedCost":0,"category":"compute","isError":true}
{"timestamp":"2026-05-31T18:19:52.918Z","hypothesisId":"h-rag-precision","toolName":"compute:modal:a10g","estimatedCost":1.12,"category":"compute","isError":false}

Read it with concrete questions:

How much has this hypothesis spent in total? Use getHypothesisSpend(cwd, id).
How much of that is compute rather than LLM spend? Use getHypothesisSpendByCategory(cwd, id).
Did every attempted run leave a compute row? Missing rows usually mean somebody assumed logging instead of verifying it.
Are failures clustering in one carrier? Repeated isError: true rows are execution evidence, not bookkeeping noise.
Did Modal GPU time get recorded with a real rate? If not, cost accountability is already broken.

What not to do:

Do not delete expensive rows because they are embarrassing.
Do not log only successful runs.
Do not leave local or docker blank because the cost is zero. Zero is still a recorded decision.
Do not leave Modal compute blank because the estimate takes work.
Do not replace atomic per-run entries with one rounded final total.
Do not keep a second private spreadsheet and call the official ledger “good enough later.”

The ledger is methodology, not bookkeeping theater. Untracked cost usually means untracked execution.

Common Rationalizations

Excuse	Reality
“The hypothesis says `local`, but Docker is cleaner on this machine.”	Carrier choice is part of the registered method. Convenience does not overrule it.
“I only changed `requirements.txt` a little.”	A little dependency drift is still dependency drift.
“I can rewrite `environment.lock` after I finish debugging.”	Retroactive compliance is theater. Stop and repair the protocol first.
“I will let the container write anywhere in the repo because it is faster.”	Wide write access destroys containment and makes the run harder to audit.
“Modal cost is hard to estimate, so I will leave compute blank.”	Untracked compute is hidden spend. Estimate it and record it.
“The first five runs already prove the point.”	Your preregistered `n` exists to stop exactly that impulse.
“I can switch carriers midway to reduce infra noise.”	Mid-run carrier changes are methodology changes after peeking.
“Failed launches do not count because no result file was produced.”	They still consumed time, budget, and feasibility. Log them.
“I only wrote the number into `RESULTS.md` as a placeholder.”	Headline files are claims, not scratchpads.
“I found a better metric after seeing the data.”	Then it is a new analysis, not this preregistered execution.

Red Flags - STOP

Stop immediately if any of these are true:

You are about to launch before checking experiments/{id}/prereg.md.
You cannot say which hypothesis id is active.
You cannot say which computeTarget the active hypothesis specifies.
judge.lock exists but you have not compared it against computeJudgeHash(judgeRef, id).
environment.lock exists but you have not compared it against the current environment hash.
environment.lock does not match and you are tempted to “just proceed once.”
The Docker container can write outside experiments/{id}/.
modal-app.py changed after the lock was recorded and you are still planning to run it.
You have already seen numbers that are not logged and not stored under smokes/.
A run finished and no compute CostRecord was appended.
You want to switch carriers because another path feels easier.
You are about to mention a provisional number in RESULTS.md, a PR, or a commit message.

All of those mean the same thing: stop, return to the contract, and repair the method before generating more evidence.

Good vs Bad

Good: route from the active hypothesis

const entries = await loadHypotheses(cwd);
const h = getActiveHypothesis(entries);
if (!h) throw new Error("No OPEN or RUNNING hypothesis.");

switch (h.computeTarget) {
  case "local":
  case "docker":
  case "modal":
    break;
  default:
    throw new Error(`Unknown compute target: ${h.computeTarget}`);
}

Good because the carrier comes from repo state, not from vibes.

Bad: pick the carrier from convenience

const target = process.env.USE_DOCKER ? "docker" : "local";

Bad because the hypothesis already owns that decision.

Good: enforce the environment lock before launch

const lockedEnv = await getEnvironmentLock(cwd, h.id);
if (!lockedEnv) throw new Error("Missing environment.lock.");

const currentEnvHash = computeCurrentEnvironmentHash();
if (lockedEnv !== currentEnvHash) {
  throw new Error("Environment drift detected.");
}

Good because the environment is checked before results exist.

Bad: rewrite the lock after editing the environment

await writeFile(`experiments/${h.id}/environment.lock`, currentEnvHash, "utf8");

Bad because you are laundering drift into compliance.

Good: contain Docker writes to the experiment directory

-v "$(pwd):/work:ro"
-v "$(pwd)/experiments/h-rag-precision:/work/experiments/h-rag-precision:rw"

Good because the container can write evidence without rewriting the repo.

Bad: mount the whole repo read-write

-v "$(pwd):/work:rw"

Bad because the run can silently mutate unrelated files.

Good: append compute cost immediately

await appendCostRecord(cwd, {
  timestamp: new Date().toISOString(),
  hypothesisId: h.id,
  toolName: `compute:${h.computeTarget}`,
  estimatedCost: h.computeTarget === "modal" ? gpuSeconds * rate : 0,
  category: "compute",
  isError: runFailed,
});

Good because compute burn is explicit and auditable.

Bad: keep cost in your head until later

const estimatedTotal = 4.0; // roughly what all runs cost

Bad because roughly is not a ledger.

Why This Matters

Clean execution buys you things improvisation never will.

Carrier discipline — the run happened on the registered target, not the convenient one.
Environment reproducibility — environment.lock means the dependencies were frozen before launch.
Containment — Docker can write evidence without rewriting the repository.
Cost accountability — every run leaves a compute row, including zero-cost local and docker runs and billable Modal runs.
Interpretive separation — measurement stays in smokes/ until later review decides what, if anything, deserves a headline.

Execution is not where you prove brilliance. Execution is where you prove restraint.

After execution is complete, use /skill:statistical-rigor, then /skill:falsification-review.

Similar Skills

ui-ux-pro-max

90.2k

ui-ux-pro-max

context7-mcp

55.5k

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

gitnexus-exploring

38.9k

Explores codebases via GitNexus: discover repos, query execution flows, trace processes, inspect symbol callers/callees, and review architecture.

1 file

gitnexus

Stats

LanguageTypeScript

Stars7

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Related skills: /skill:preregistration, /skill:baseline-reproduction, /skill:falsification-review, /skill:kill-or-ship

Experiment Execution

Overview

Execution is where method turns into evidence. Not where you redesign the method, switch carriers because today is inconvenient, or leak a promising number into a headline file.

Your job is narrow:

load the active hypothesis from repo state
obey its computeTarget
enforce prereg.md, judge.lock, and environment.lock before launch
run the committed sample size n
keep outputs provisional under experiments/{id}/smokes/
append a real compute row after every run with appendCostRecord(...)

Quick Reference

Need	File or API	Rule
Load hypothesis state	`loadHypotheses(cwd)`	Start from repo state, not memory
Select the live experiment	`getActiveHypothesis(entries)`	Do not guess the active `id`
Read execution target	`HypothesisEntry.computeTarget`	`local`, `docker`, or `modal` is part of the method
Mark execution start	`updateHypothesisStatus(cwd, id, "RUNNING")`	Bookkeeping only
Check preregistration	`fileExists("experiments/{id}/prereg.md")`	No prereg, no run
Read judge lock	`getJudgeLock(cwd, id)`	Missing or drifted judge lock blocks scored execution
Read environment lock	`getEnvironmentLock(cwd, id)`	Missing or drifted environment lock blocks all execution
Compare judge hash	`computeJudgeHash(judgeRef, id)`	Check before the first scored call
Read spend	`getHypothesisSpend(cwd, id)`	Watch the cap before more runs
Split spend by type	`getHypothesisSpendByCategory(cwd, id)`	Separate `llm` from `compute` burn
Append compute cost	`appendCostRecord(cwd, record)`	Use `category: "compute"` after every attempted run
Local or docker env contract	active `Dockerfile` + `requirements.txt`	Hash the exact pair used for the run
Modal env contract	`experiments/{id}/modal-app.py`	Hash the exact file you will execute
Docker writable path	`experiments/{id}/`	Mount this read-write; everything else read-only
Provisional artifacts	`experiments/{id}/smokes/`	Logs and raw outputs live here
Headline files	`experiments/{id}/RESULTS.md` and any root `RESULTS.md`	Do not write here yet

The Iron Law

NO RUN WITHOUT LOCKS; NO RUN WITHOUT A LEDGER ROW

Every legitimate run leaves four traces:

experiments/{id}/prereg.md
the relevant lock files
provisional artifacts under experiments/{id}/smokes/
a cost row in .epistemic/cost-ledger.jsonl

If any trace is missing, you do not have evidence. You have a story.

When to Use

Use this skill when:

experiments/{id}/prereg.md already exists
a hypothesis in HYPOTHESES.md is OPEN or RUNNING
you are about to write experiment code, launch a benchmark, or call a model-backed evaluator
the run will create provisional outputs that must stay out of headline files
the active hypothesis specifies computeTarget: local, docker, or modal
the run consumes real budget and every attempt must land in the ledger
environment.lock and, if applicable, judge.lock must be enforced before launch
you need to finish the preregistered sample size n and compute the promised summary

When NOT to Use

Do not use this skill when:

the claim is still being framed — use /skill:research-question
the method is still being locked — use /skill:preregistration
you are reproducing an external baseline — use /skill:baseline-reproduction
you are attacking the finished result — use /skill:falsification-review
you are making the terminal decision — use /skill:kill-or-ship
you want one “quick run” before preregistration or lock files are settled

A quick run that changes what you believe is not a harmless preview. It is a real run with missing governance.

The Process

1. Identify the active hypothesis and its carrier

Call loadHypotheses(cwd).
Call getActiveHypothesis(entries).
Read the execution-critical fields from the live entry: id, claim, falsifier, n, judgeRef, baselineRef, costCap, computeTarget, and status.
Treat computeTarget as part of the preregistered method.
Valid values are local, docker, and modal.
Do not infer the target from convenience, machine state, installed tools, or personal preference.
Derive the working paths immediately:
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
- experiments/{id}/environment.lock
- experiments/{id}/modal-app.py
- experiments/{id}/smokes/
- experiments/{id}/RESULTS.md
- .epistemic/cost-ledger.jsonl
If execution is legitimately starting now, move OPEN to RUNNING with updateHypothesisStatus(cwd, id, "RUNNING").
Do not mark anything CONFIRMED here.
Execution measures.
Review interprets.

2. Refuse to run without `prereg.md`

Confirm experiments/{id}/prereg.md exists with fileExists(path).
If it does not exist, stop immediately.
Do not run bun, python, pytest, benchmark, train, docker, or modal commands and promise yourself you will document the method later.
Read the preregistration and extract the operational contract:
- planned sample size n
- task or dataset slice
- metric definition
- judge configuration if scoring is involved
- stopping rule
- retry or exclusion policy
- execution carrier if the prereg makes it explicit
Compare that contract against the active HypothesisEntry.
If n, judgeRef, baselineRef, or computeTarget disagree between preregistration and HYPOTHESES.md, repair the inconsistency before running.
Treat preregistration as executable governance.
It is not a diary entry.

3. Enforce the judge lock before any scored evaluation

Read judgeRef from the active hypothesis.
Load the stored lock with getJudgeLock(cwd, id).
Compute the expected value with computeJudgeHash(judgeRef, id).
Compare the two before the first scored call, not after the number is already on your screen.
If the lock is missing and this is the first legitimate locked run, create it deliberately with writeJudgeLock(cwd, id, judgeRef).
If the lock exists and does not match, stop.
Do not rationalize that the prompt change was cosmetic or the model version shift was basically the same.
Judge drift is method drift.

4. Enforce the environment lock before any compute launches

Load the stored lock with getEnvironmentLock(cwd, id).
If it returns null, stop.
Execution reads environment.lock.
It does not mint or rewrite it mid-run.
For local and docker, the environment contract is the exact Dockerfile and requirements.txt registered for this run.
If the repo could plausibly contain more than one pair, preregistration must name the pair.
Execution does not guess.
Hash the current contents of that Dockerfile plus requirements.txt in a deterministic order and compare the result to environment.lock.
If either file is missing, stop.
Missing dependency files are a contract failure, not a setup detail.
For modal, the environment contract is experiments/{id}/modal-app.py.
Write that file from the locked protocol if the modal target requires it, then treat the exact file you will execute as frozen.
Hash the current contents of modal-app.py and compare the result to environment.lock.
If the stored lock and current hash differ, stop.
That is environment drift.
Do not “just refresh” environment.lock after editing Dockerfile, requirements.txt, or modal-app.py.
The lock exists to prevent exactly that move.
Local and docker share an environment contract.
The carrier may differ.
The locked dependencies do not.

5. Arm the cost ledger before the first run

The canonical ledger is .epistemic/cost-ledger.jsonl.
Read the current total with getHypothesisSpend(cwd, id).
If you need the split, read getHypothesisSpendByCategory(cwd, id).
If you need portfolio context, read getAllHypothesisSpends(cwd).
Treat compute cost as first-class spend, not invisible overhead.
Plan to append one CostRecord with category: "compute" after every attempted run.
Failed launches still count as attempts and still get a ledger row.
Do not batch cost entries at the end of the day.
If the remaining budget cannot support the committed run set, stop and hand off to the override or kill flow.

6. Route execution from `computeTarget`, not from convenience

Branch strictly on h.computeTarget.
local means run in a virtual environment under the locked dependency contract.
Use the venv as the carrier, not as permission to install extras or patch dependencies ad hoc.
Capture the merged stdout/stderr and persist it to experiments/{id}/smokes/run-{n}.log.
docker means build from the registered Dockerfile, then execute inside a container.
Run from /work inside the container.
Mount the repo read-only and mount experiments/{id}/ read-write.
The safe shape is:

docker run --rm \
  -v "$(pwd):/work:ro" \
  -v "$(pwd)/experiments/{id}:/work/experiments/{id}:rw" \
  -w /work \
  <image> <command>

That mount pattern is containment, not decoration.
The container must not rewrite unrelated files.
Capture build and run output.
If you use one log per run, it must be complete enough to audit the launch.
modal means write experiments/{id}/modal-app.py with @modal.app() and @modal.function() decorators.
Install dependencies inside the Modal image definition, not ad hoc on the host.
Launch with modal run experiments/{id}/modal-app.py.
Capture the merged stdout/stderr from that command and persist it to experiments/{id}/smokes/run-{n}.log.
Never switch carriers mid-experiment because another path feels easier today.
Carrier drift is method drift.

7. Append compute cost immediately after each run

After every attempted run, call appendCostRecord(cwd, record).
Use the real CostRecord shape from src/state/repo.ts.
Set category: "compute".
For local and docker, record estimatedCost: 0.
In this repo, those carriers are tracked as compute work with zero billed external cost.
For modal, record estimatedCost = gpuSeconds × rate.
Fix rate before the run.
Do not reverse-engineer it after seeing the result.
Set toolName to the actual backend.
Set isError to reflect whether the run failed.
Append the row immediately after the run ends.
Do not rely on memory.
Do not leave Modal compute blank because estimating it is annoying.

8. Finish the committed sample size

Do not summarize after 3 of 30 runs because the trend looks obvious.
Do not stop early because the result is already good enough for a slide.
Do not stop early because the result is ugly and you would rather redesign the method.
Complete the full n unless the preregistered stopping rule explicitly says otherwise.
Keep the important invariants fixed across runs:
- same dataset slice
- same prompt and extraction rule
- same judge
- same model settings
- same metric definition
- same computeTarget
Do not peek at early outputs and then revise the method mid-run.
That means no prompt edits, no seed swaps, no ad hoc retries, no threshold tweaks, no carrier changes, and no selective cleanup after seeing weak runs.
If infrastructure actually fails, follow the retry rule that was already written in preregistration.
If preregistration did not define the failure policy, stop and repair the protocol before continuing.

9. Store provisional artifacts in `smokes/` and nowhere else

Use experiments/{id}/smokes/ as the provisional artifact directory.
Put raw per-run outputs there with deterministic names such as:
- run-001.log
- run-001.json
- run-002.log
- aggregate.md
- notes.md
Keep experiment-local evidence under the experiment directory even if the repo scaffold also contains a top-level smokes/.
Treat everything in experiments/{id}/smokes/ as provisional and non-quotable.
Non-quotable means the number does not belong in:
- experiments/{id}/RESULTS.md
- a root RESULTS.md
- a README
- a PR description
- a commit message
- a polished claim sentence
Working notes belong in smokes/ too if they mention provisional numbers.
Do not write to experiments/{id}/falsifiers/{model}.md here.
That directory belongs to later adversarial review.

10. Compute the prescribed summary only after collection ends

Once the full sample is complete, compute the summary statistics named in preregistration.
Mean, median, pass rate, win rate, confidence interval, and error rate are valid only if they were planned.
Do not invent a new flattering metric because the original one underperformed.
Keep raw observations and summary outputs side by side inside experiments/{id}/smokes/.
Raw data without a summary is hard to audit.
Summary without raw data is easy to manipulate.
If the completed summary is surprising, note the surprise.
Do not retroactively rewrite the method.

11. Stop at the publication boundary

When execution is complete, stop.
Do not write to experiments/{id}/RESULTS.md yet.
If the repo also maintains a root RESULTS.md, do not write there either.
Do not upgrade the hypothesis to CONFIRMED just because the mean looks good.
Do not write a headline sentence like “we beat baseline X by Y%” before later review.
The output of execution is a clean handoff package, not a claim:
- experiments/{id}/prereg.md
- experiments/{id}/judge.lock
- experiments/{id}/environment.lock
- experiments/{id}/modal-app.py if computeTarget is modal
- experiments/{id}/smokes/ artifacts
- the relevant rows in .epistemic/cost-ledger.jsonl
- the active hypothesis entry in HYPOTHESES.md
If execution exhausted the cost cap, say so plainly.
The next question is not whether this can be published.
The next question is whether the provisional result survives scrutiny.

Reading the Cost Ledger

.epistemic/cost-ledger.jsonl is JSON Lines: one JSON object per line, append-only. Do not treat it like one array and do not rewrite history to make spend look cleaner.

The CostRecord shape defined in src/state/repo.ts is:

interface CostRecord {
  timestamp: string;
  hypothesisId: string;
  toolName: string;
  estimatedCost: number;
  category: "llm" | "compute";
  isError: boolean;
}

Execution adds category: "compute" rows. Examples:

{"timestamp":"2026-05-31T18:04:11.233Z","hypothesisId":"h-rag-precision","toolName":"compute:local","estimatedCost":0,"category":"compute","isError":false}
{"timestamp":"2026-05-31T18:12:44.901Z","hypothesisId":"h-rag-precision","toolName":"compute:docker","estimatedCost":0,"category":"compute","isError":true}
{"timestamp":"2026-05-31T18:19:52.918Z","hypothesisId":"h-rag-precision","toolName":"compute:modal:a10g","estimatedCost":1.12,"category":"compute","isError":false}

Read it with concrete questions:

How much has this hypothesis spent in total? Use getHypothesisSpend(cwd, id).
How much of that is compute rather than LLM spend? Use getHypothesisSpendByCategory(cwd, id).
Did every attempted run leave a compute row? Missing rows usually mean somebody assumed logging instead of verifying it.
Are failures clustering in one carrier? Repeated isError: true rows are execution evidence, not bookkeeping noise.
Did Modal GPU time get recorded with a real rate? If not, cost accountability is already broken.

What not to do:

Do not delete expensive rows because they are embarrassing.
Do not log only successful runs.
Do not leave local or docker blank because the cost is zero. Zero is still a recorded decision.
Do not leave Modal compute blank because the estimate takes work.
Do not replace atomic per-run entries with one rounded final total.
Do not keep a second private spreadsheet and call the official ledger “good enough later.”

The ledger is methodology, not bookkeeping theater. Untracked cost usually means untracked execution.

Common Rationalizations

Excuse	Reality
“The hypothesis says `local`, but Docker is cleaner on this machine.”	Carrier choice is part of the registered method. Convenience does not overrule it.
“I only changed `requirements.txt` a little.”	A little dependency drift is still dependency drift.
“I can rewrite `environment.lock` after I finish debugging.”	Retroactive compliance is theater. Stop and repair the protocol first.
“I will let the container write anywhere in the repo because it is faster.”	Wide write access destroys containment and makes the run harder to audit.
“Modal cost is hard to estimate, so I will leave compute blank.”	Untracked compute is hidden spend. Estimate it and record it.
“The first five runs already prove the point.”	Your preregistered `n` exists to stop exactly that impulse.
“I can switch carriers midway to reduce infra noise.”	Mid-run carrier changes are methodology changes after peeking.
“Failed launches do not count because no result file was produced.”	They still consumed time, budget, and feasibility. Log them.
“I only wrote the number into `RESULTS.md` as a placeholder.”	Headline files are claims, not scratchpads.
“I found a better metric after seeing the data.”	Then it is a new analysis, not this preregistered execution.

Red Flags - STOP

Stop immediately if any of these are true:

You are about to launch before checking experiments/{id}/prereg.md.
You cannot say which hypothesis id is active.
You cannot say which computeTarget the active hypothesis specifies.
judge.lock exists but you have not compared it against computeJudgeHash(judgeRef, id).
environment.lock exists but you have not compared it against the current environment hash.
environment.lock does not match and you are tempted to “just proceed once.”
The Docker container can write outside experiments/{id}/.
modal-app.py changed after the lock was recorded and you are still planning to run it.
You have already seen numbers that are not logged and not stored under smokes/.
A run finished and no compute CostRecord was appended.
You want to switch carriers because another path feels easier.
You are about to mention a provisional number in RESULTS.md, a PR, or a commit message.

All of those mean the same thing: stop, return to the contract, and repair the method before generating more evidence.

Good vs Bad

Good: route from the active hypothesis

const entries = await loadHypotheses(cwd);
const h = getActiveHypothesis(entries);
if (!h) throw new Error("No OPEN or RUNNING hypothesis.");

switch (h.computeTarget) {
  case "local":
  case "docker":
  case "modal":
    break;
  default:
    throw new Error(`Unknown compute target: ${h.computeTarget}`);
}

Good because the carrier comes from repo state, not from vibes.

Bad: pick the carrier from convenience

const target = process.env.USE_DOCKER ? "docker" : "local";

Bad because the hypothesis already owns that decision.

Good: enforce the environment lock before launch

const lockedEnv = await getEnvironmentLock(cwd, h.id);
if (!lockedEnv) throw new Error("Missing environment.lock.");

const currentEnvHash = computeCurrentEnvironmentHash();
if (lockedEnv !== currentEnvHash) {
  throw new Error("Environment drift detected.");
}

Good because the environment is checked before results exist.

Bad: rewrite the lock after editing the environment

await writeFile(`experiments/${h.id}/environment.lock`, currentEnvHash, "utf8");

Bad because you are laundering drift into compliance.

Good: contain Docker writes to the experiment directory

-v "$(pwd):/work:ro"
-v "$(pwd)/experiments/h-rag-precision:/work/experiments/h-rag-precision:rw"

Good because the container can write evidence without rewriting the repo.

Bad: mount the whole repo read-write

-v "$(pwd):/work:rw"

Bad because the run can silently mutate unrelated files.

Good: append compute cost immediately

await appendCostRecord(cwd, {
  timestamp: new Date().toISOString(),
  hypothesisId: h.id,
  toolName: `compute:${h.computeTarget}`,
  estimatedCost: h.computeTarget === "modal" ? gpuSeconds * rate : 0,
  category: "compute",
  isError: runFailed,
});

Good because compute burn is explicit and auditable.

Bad: keep cost in your head until later

const estimatedTotal = 4.0; // roughly what all runs cost

Bad because roughly is not a ledger.

Why This Matters

Clean execution buys you things improvisation never will.

Carrier discipline — the run happened on the registered target, not the convenient one.
Environment reproducibility — environment.lock means the dependencies were frozen before launch.
Containment — Docker can write evidence without rewriting the repository.
Cost accountability — every run leaves a compute row, including zero-cost local and docker runs and billable Modal runs.
Interpretive separation — measurement stays in smokes/ until later review decides what, if anything, deserves a headline.

Execution is not where you prove brilliance. Execution is where you prove restraint.

After execution is complete, use /skill:statistical-rigor, then /skill:falsification-review.

experiment-execution

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

experiment-execution

Popularity

Invocation

Context Preview

SKILL.md

Experiment Execution

Overview

Quick Reference

The Iron Law

When to Use

When NOT to Use

The Process

1. Identify the active hypothesis and its carrier

2. Refuse to run without prereg.md

3. Enforce the judge lock before any scored evaluation

4. Enforce the environment lock before any compute launches

5. Arm the cost ledger before the first run

6. Route execution from computeTarget, not from convenience

7. Append compute cost immediately after each run

8. Finish the committed sample size

9. Store provisional artifacts in smokes/ and nowhere else

10. Compute the prescribed summary only after collection ends

11. Stop at the publication boundary

Reading the Cost Ledger

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: route from the active hypothesis

Bad: pick the carrier from convenience

Good: enforce the environment lock before launch

Bad: rewrite the lock after editing the environment

Good: contain Docker writes to the experiment directory

Bad: mount the whole repo read-write

Good: append compute cost immediately

Bad: keep cost in your head until later

Why This Matters

Similar Skills

Help us improve

Experiment Execution

Overview

Quick Reference

The Iron Law

When to Use

When NOT to Use

The Process

1. Identify the active hypothesis and its carrier

2. Refuse to run without prereg.md

3. Enforce the judge lock before any scored evaluation

4. Enforce the environment lock before any compute launches

5. Arm the cost ledger before the first run

6. Route execution from computeTarget, not from convenience

7. Append compute cost immediately after each run

8. Finish the committed sample size

9. Store provisional artifacts in smokes/ and nowhere else

10. Compute the prescribed summary only after collection ends

11. Stop at the publication boundary

Reading the Cost Ledger

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: route from the active hypothesis

Bad: pick the carrier from convenience

Good: enforce the environment lock before launch

Bad: rewrite the lock after editing the environment

Good: contain Docker writes to the experiment directory

Bad: mount the whole repo read-write

Good: append compute cost immediately

Bad: keep cost in your head until later

Why This Matters

2. Refuse to run without `prereg.md`

6. Route execution from `computeTarget`, not from convenience

9. Store provisional artifacts in `smokes/` and nowhere else

2. Refuse to run without `prereg.md`

6. Route execution from `computeTarget`, not from convenience

9. Store provisional artifacts in `smokes/` and nowhere else