Skill

run

Runs a sustained metric-improvement loop: reads program.md, iterates specialist ideation agents, commits atomically, auto-rolls back on regression. For long-running automated improvement campaigns.

Git

automation

developer-tools

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/research:run <program.md> [clarification] [--resume <program.md>] [--team] [--compute=local|colab|docker] [--colab[=H100|L4|T4|A100]] [--codex] [--researcher] [--architect] [--journal] [--hypothesis <path>] [--keep "<items>"]

User invocable

Model invocation disabled

Inline context

Effort: high

Argument hint

<program.md> [clarification] [--resume <program.md>] [--team] [--compute=local|colab|docker] [--colab[=H100|L4|T4|A100]] [--codex] [--researcher] [--architect] [--journal] [--hypothesis <path>] [--keep "<items>"]

Tool Access

This skill is limited to the following tools:

ReadWriteEditBashGrepGlobAgentTaskCreateTaskUpdateAskUserQuestion

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Supporting Files

modes/codex-copilot.mdmodes/colab-setup.mdmodes/compute-docker.mdmodes/hypothesis-pipeline.mdmodes/phase5-metric.mdmodes/report.mdmodes/resume.mdmodes/team.mdprotocol.md

SKILL.md

695 lines · ~11k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Parent stars23

Parent forks3

MaintenanceExcellent

Last CommitJul 15, 2026

Actions

View Source View Plugin View on GitHub View README

`agent_strategy`	Specialist agent	When to use
`auto`	heuristic	Default — infer from metric_cmd keywords
`perf`	`foundry:perf-optimizer`	latency, throughput, memory, GPU utilization
`code`	`foundry:sw-engineer`	coverage, complexity, lines, coupling
`ml`	`research:scientist`	accuracy, loss, F1, AUC, BLEU
`arch`	`foundry:solution-architect`	coupling, cohesion, modularity metrics

Current strategy	Next strategy	Escalation agent
`code`	`ml`	`research:scientist`
`ml`	`perf`	`foundry:perf-optimizer`
`perf`	`code`	`foundry:sw-engineer`
`arch`	`code`	`foundry:sw-engineer` (fallback `foundry:solution-architect` if sw-engineer unavailable)
`auto`	infer from resolved strategy	follow rotation row for whichever concrete strategy `auto` heuristics resolved to at Step R3 (e.g. `auto` → resolved `ml` → next `perf` → `foundry:perf-optimizer`)

Agent Resolution

# loads: compaction-contract.md
_RESEARCH_SHARED=$(python "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/resolve_shared.py" 2>/dev/null)  # timeout: 5000
[ -z "$_RESEARCH_SHARED" ] && { echo "! Plugin path resolution failed — ensure research plugin installed and CLAUDE_PLUGIN_ROOT set, or invoke from project root."; exit 1; }

CLAUDE_SKILL_DIR resolution — constants block provides default plugins/research/skills/run (source-tree path). Resolve to installed path before use:

CLAUDE_SKILL_DIR=$(ls -td ~/.claude/plugins/cache/borda-ai-rig/research/*/skills/run 2>/dev/null | head -1)
[ -z "$CLAUDE_SKILL_DIR" ] && CLAUDE_SKILL_DIR="$(git rev-parse --show-toplevel 2>/dev/null)/plugins/research/skills/run"

Read $_RESEARCH_SHARED/agent-resolution.md. Contains: foundry check + fallback table. If foundry not installed: use table to substitute each foundry:X with general-purpose. Agents this skill uses: foundry:sw-engineer, foundry:linting-expert, foundry:perf-optimizer, foundry:solution-architect, research:scientist.

Default Mode (Steps R1–R7)

Triggered by run <goal|file.md>.

Task tracking: create tasks R0–R7 at start. If no --researcher/--architect, mark R0 skipped. If --codex active, create task R5b: Codex co-pilot (iter ?/max) status pending.

Step R0: Hypothesis pre-phase (`--researcher` / `--architect`)

If no --researcher/--architect, skip to R1.

Flag combination guard: if --researcher set but --architect NOT set, print ⚠ --researcher without --architect: hypotheses will NOT be validated for architectural feasibility before execution — infeasible hypotheses may waste iterations. Add --architect for feasibility filtering. then continue (advisory, not blocking). If only --architect set without --researcher, feasibility filter applies to oracle-generated hypotheses — valid combination.

Read ${CLAUDE_SKILL_DIR}/modes/hypothesis-pipeline.md

Per-iteration hypothesis selection (when --researcher/--architect set, inside R5 loop): pop next from RESEARCH_QUEUE. Append to Phase 2 prompt: "Focus this iteration on testing this hypothesis: <hypothesis text>."

Per-iteration journal hook (inside R5, after Phase 7): if --journal active, append entry to <RUN_DIR>/journal.md after EVERY iteration — regardless of outcome. Entry format: protocol.md (companion file, same skill dir). # loads: protocol.md Journals record kept and reverted iterations so ideation agent learns failed approaches.

Per-iteration checkpoint write (after Phase 7): if --researcher/--architect active, append one line to <RUN_DIR>/checkpoint.json per schema in protocol.md (companion file, same skill dir): {iteration, hypothesis_id, metric_before, metric_after, status: "passed"|"rolled_back"}.

Step R1: Load / build config

--resume flag detection: if --resume in args, extract optional program.md path. Jump to ## Resume Mode. Rest of R1 and R2–R7 skipped.

--hypothesis <path> parsing: if --hypothesis in args, extract path token following it. Verify file exists: [ -f "$HYPOTHESIS_PATH" ]. If not found: print ! --hypothesis <path>: file not found and stop. If found: set hypothesis_override = true. In R5 Phase 2 (Propose change), replace oracle-generated hypothesis with loaded file content — prepend to ideation agent prompt: "Use this pre-specified hypothesis as your starting hypothesis for iteration N: . Validate, refine, and implement it. Do not generate a new hypothesis from scratch."

Auto-detect: first non-flag arg ends in .md → parse as program file. Otherwise → text goal.

Clarification prompt (.md file only): after extracting .md path, inspect next token (before -- flags):

Absent or starts with -- → clarification_prompt = null
Quoted string (starts/ends with ") → extract as clarification_prompt, strip quotes
Bare unquoted token (no --, no ") → accept as clarification_prompt; print: ℹ clarification set to "<token>" (tip: quote multi-word hints — e.g. "/research:run program.md \"focus on sort\" --codex")

After clarification extraction, remaining non-flag tokens (not starting --) are unrecognized. For each, print:

⚠ Unrecognized argument "<token>" — ignored.
  Known positional args: <program.md path> [clarification]
  Known flags: --team, --colab[=HW], --codex, --compute=local|colab|docker, --researcher, --architect, --journal, --hypothesis <path>, --codemap, --no-codemap
  If you meant to override the algo, edit the ## Config block in your program.md (algo: sort) and update ## Metric to match.
  If you meant to set a clarification hint, pass it as a quoted string: "/research:run program.md \"sort improvements\" --codex"

# Extract --keep quoted value (compaction-contract.md §keep semantics)
KEEP_ITEMS=""
if [[ "$ARGUMENTS" =~ --keep[[:space:]]\"([^\"]+)\" ]]; then
    KEEP_ITEMS="${BASH_REMATCH[1]}"
fi
# Clear stale contract from any prior incomplete run (compaction-contract.md §Lifecycle)
rm -f .claude/state/skill-contract.md  # timeout: 5000
echo "${KEEP_ITEMS:-}" > "${TMPDIR:-/tmp}/research-run-keep-items"  # persist for Phase 8 contract write

Unsupported flag check: follow $_RESEARCH_SHARED/unsupported-flag-protocol.md. Supported flags for this skill: --resume, --team, --compute, --colab, --codex, --researcher, --architect, --journal, --hypothesis, --scientist, --codemap, --no-codemap, --keep.

Codemap auto-detection — structural blast-radius context for modules the experiment edits; on by default when codemap installed + index found. --no-codemap opts out; --codemap is strict (fail if unavailable).

# timeout: 5000
CODEMAP_RAW=auto
[[ " $ARGUMENTS " == *" --no-codemap "* ]] && CODEMAP_RAW=off
[[ " $ARGUMENTS " == *" --codemap "* ]] && [[ " $ARGUMENTS " != *" --no-codemap "* ]] && CODEMAP_RAW=strict
CODEMAP_ENABLED=$("${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/codemap-resolve" "$CODEMAP_RAW")
if [ $? -ne 0 ]; then
    [ "$CODEMAP_RAW" = "strict" ] && { echo "! BLOCKED — --codemap (strict) but codemap unavailable; run /codemap:scan-codebase or install codemap plugin"; exit 1; }
    CODEMAP_ENABLED=false
fi
echo "$CODEMAP_ENABLED" > ${TMPDIR:-/tmp}/research-run-codemap-enabled

loads: codemap-gates.md

When CODEMAP_RAW ≠ off: read $_RESEARCH_SHARED/codemap-gates.md — follow Gate A and Gate B.

If argument is a .md file — read and parse with these rules:

Find each ## <Section> heading (case-insensitive).
Extract first fenced code block following that heading.
Parse block as key: value lines; multi-value = indented - value items. Paths with spaces: wrap in double quotes.
Missing required fields (command under ## Metric/## Guard) → stop with error.
agent_strategy: auto (or omitted) → apply keyword heuristics from <constants> to ## Goal text and metric command.
target under ## Metric: direction: higher → stop when metric ≥ target; direction: lower → stop when metric ≤ target. If target omitted, run until max_iterations.
Unrecognized keys/headings → warn once, ignore.
## Notes and # Program: title never parsed — human-only. (# Campaign: accepted as alias.)

If argument is text — auto-detect metric_cmd/guard_cmd from goal string and codebase scan (same as P-P1, non-interactive). config.json not read.

--colab[=HW] parsing: --colab (no =) → compute = "colab", colab_hw = null. --colab=<value> → compute = "colab", colab_hw = <value> (uppercased). Unknown <value> (not in {H100, L4, T4, A100}) → print "⚠ Unknown Colab hardware '<value>' — proceeding with default GPU. Known: H100, L4, T4, A100", set colab_hw = null. --compute=colab (no HW) → compute = "colab", colab_hw = null.

colab_hw in ## Config sets hardware preference (H100, L4, T4, A100); CLI --colab=HW overrides.

Generate run-id = $(date -u +%Y-%m-%dT%H-%M-%SZ). Assign immediately:

RUN_ID=$(date -u +%Y-%m-%dT%H-%M-%SZ)
RUN_DIR=".experiments/${RUN_ID}"  # hypothesis pipeline + journal outputs (per <constants> note)
STATE_DIR=".experiments/state/${RUN_ID}"  # per-iteration artifacts (state.json, experiments.jsonl, diary.md)
mkdir -p "$RUN_DIR" "$STATE_DIR"  # timeout: 5000 — both dirs created before any Write to either
echo "$RUN_ID" > "${TMPDIR:-/tmp}/research-run-id"  # persist for Phase 8 contract write (Check 41: fresh shell)

Note: STATE_DIR (.experiments/state/${RUN_ID}/) is per-iteration artifact dir — distinct from RUN_DIR. Both coexist; see <constants> block.

Create run directory:

.experiments/state/<run-id>/
  state.json         ← iteration count, best metric, status
  experiments.jsonl  ← one line per iteration
  diary.md           ← human-readable research diary (hypothesis → outcome → decision)

Convert program_file to absolute path: realpath "$PROGRAM_FILE" — Resume Mode matches on absolute path.

Write initial state.json (program_file = absolute path to .md or null for text goal):

{
  "run_id": "<run-id>",
  "goal": "<goal>",
  "config": {},
  "program_file": "<absolute path to program.md, or null>",
  "iteration": 0,
  "best_metric": null,
  "best_commit": null,
  "status": "initializing",
  "started_at": "<ISO timestamp>",
  "clarification_prompt": null,
  "colab_hw": null,
  "sandbox_mode": "local"
}

Note: status is "initializing" until all R2 precondition checks pass — resume treats "initializing" as failed-init, not active run. Update to "running" at end of R2 (after all checks pass).

Step R2: Precondition checks

Run all checks before touching code. Fail fast with clear message:

Clean git: git status --porcelain → must be empty. If dirty: print dirty files and stop.
Not detached HEAD: git rev-parse --abbrev-ref HEAD → must not be HEAD.
Metric command numeric: run metric_cmd once; parse stdout for float. If no float: show output and stop.
Guard passes: run guard_cmd once; must exit 0. If fails: show output and stop.
--colab check: verify mcp__colab-mcp__runtime_execute_code available. If not, print setup instructions (see Colab MCP section) and stop. If --colab=HW (colab_hw non-null): print: Hardware requested: --colab=<colab_hw>. Ensure your Colab notebook running with <colab_hw> GPU.
--codex check: verify claude plugin list 2>/dev/null | grep -q 'codex@openai-codex'. If unavailable: print ⚠ codex plugin not found. Install it with: /plugin marketplace add openai/codex-plugin-cc and stop.
compute: docker check: run docker ps via Bash (timeout: 5000). If non-zero: print ⚠ Docker daemon not running. Start Docker Desktop and retry. and stop.
Flag conflict: if --colab and --compute=docker both active: print ⚠ --colab and --compute=docker are mutually exclusive. Use one or the other. and stop.
--colab + --codex compatibility note (non-blocking): if both flags active, print ℹ --colab + --codex active: Codex Phase 2c will receive colab_hw context so generated code can target the right GPU (H100/T4 bf16 vs fp16). Phase 5 metric verification runs through Colab MCP as usual. and continue. Pass colab_hw to Codex spawn prompt (Phase 2c — see modes/codex-copilot.md).
--journal prerequisite: verify --researcher/--architect also set. If neither: print ⚠ --journal requires --researcher or --architect — omit --journal or add a hypothesis pipeline flag. and stop.

--codex-delegation warning (non-blocking): check whether .claude/skills/_shared/codex-delegation.md exists (deployed by /foundry:setup (requires foundry plugin) from foundry plugin to .claude/skills/_shared/). If not found:

# codex-delegation.md is deployed by /foundry:setup to .claude/skills/_shared/ (requires foundry plugin — if absent, R7 Codex delegation is skipped automatically)
[ -f ".claude/skills/_shared/codex-delegation.md" ] || echo "⚠ .claude/skills/_shared/codex-delegation.md not found. R7 Codex delegation will be skipped. Run /foundry:setup (requires foundry plugin) to install it."

Set CODEX_DELEGATION_AVAILABLE=true if found, false otherwise. Continue regardless.

Initialize sandbox + timeout variables (after all checks pass — constants YAML block not auto-exported to bash; assign explicitly with ${VAR:-default} to honour environment overrides; ADV-L15 / ADV-M20):

SANDBOX_NETWORK="${SANDBOX_NETWORK:-none}"  # override via program.md Config or environment variable
# Verify timeout — 120s local, 300s Colab per <constants>; bash overrides via VERIFY_TIMEOUT_SEC env var
if [ "${compute:-local}" = "colab" ]; then
    VERIFY_TIMEOUT_SEC="${VERIFY_TIMEOUT_SEC:-300}"
else
    VERIFY_TIMEOUT_SEC="${VERIFY_TIMEOUT_SEC:-120}"
fi
VERIFY_TIMEOUT_MS=$((VERIFY_TIMEOUT_SEC * 1000))
# Ideation Agent() calls are synchronous — no mid-flight poll; after each returns, check its output file and mark timed_out (⏱) if empty.

Initialize sandbox_mode:

compute: docker (daemon check passed in #7) → sandbox_mode = "docker". Print: sandbox: Docker daemon reachable — sandbox mode active
All other cases (compute: local, compute: colab) → sandbox_mode = "local"

Update state.json status to "running" — write only after ALL checks above pass. Resume treats "initializing" as failed-init and skips such runs.

Step R3: Select ideation agent

Apply agent_strategy mapping from <constants>. If auto, apply keyword heuristics to metric_cmd. Log selected agent to state.json.

Step R4: Establish baseline (iteration 0)

Run metric_cmd and guard_cmd. Parse metric value. Append to experiments.jsonl:

{
  "iteration": 0,
  "commit": "<HEAD sha>",
  "metric": 0.0,
  "delta": 0.0,
  "guard": "pass",
  "status": "baseline",
  "description": "baseline",
  "agent": null,
  "confidence": null,
  "timestamp": "<ISO>",
  "files": []
}

Update state.json: best_metric = <baseline>, best_commit = <HEAD sha>.

Print: Baseline: <metric_cmd key> = <value>.

Write initial diary header to .experiments/state/<run-id>/diary.md:

# Research Diary — <goal>

**Run**: <run-id>
**Started**: <ISO timestamp>
**Baseline**: <metric_key> = <baseline value>

---

Then proceed to R5.

Step R5: Iteration loop

# REPO_SLUG / BRANCH_SLUG: source the single authorized slug form (see <constants>)
eval "$(bash "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/git_slugs.sh")"  # timeout: 3000
COMMIT_SENTINEL="${TMPDIR:-/tmp}/claude-commit-auth-${REPO_SLUG}-${BRANCH_SLUG}"
touch "$COMMIT_SENTINEL"  # timeout: 3000
# trap not effective across Bash calls — commit protection handled by commit-guard.js hook (foundry-owned; see caveat below)

Dependency — commit-guard.js (requires foundry plugin): the commit-sentinel dance above (touch at R5, re-touch each phase, rm at cleanup) is enforced by foundry's commit-guard.js PreToolUse hook. That hook ships with the foundry plugin only — research does not bundle it. Standalone install (foundry absent): the sentinel touches become inert and git commit proceeds unguarded. The sentinel logic is still safe to run (touch/rm on a temp file are harmless no-ops without the hook); it simply provides no protection. If you rely on atomic-commit guarding during research:run, install foundry.

Sentinel liveness: touch $COMMIT_SENTINEL after each Phase 8 result write to extend monitoring window — do NOT rely solely on sentinel touched at loop start; slow iterations exceed 15-min TTL. Re-derive slug per SENTINEL_SLUG_FORMULA from <constants> (bash state lost between calls).

--team mode: If --team active, Read ${CLAUDE_SKILL_DIR}/modes/team.md and execute Phases A–D in place of standard iteration loop below.

--team + --hypothesis combination: combinable. Team mode uses provided hypothesis path and skips oracle/hypothesis-generation phase — hypothesis_override = true applies inside team.md Phase A same as solo mode.

For each iteration i from 1 to max_iterations:

Phase overview (all phases run per iteration):

Phase	Name	Trigger / description
0	Print header	Always — print `[→ Iter N/max · starting]`; TaskUpdate R5 subject with current iteration
1	Build context	Always — build compact context from git log, JSONL history, and recent diff
2	Propose change	Always — spawn specialist agent to read code, research, investigate, and generate a hypothesis with optional sandbox scripts
2a	Sandbox validate	`compute: docker` only — run agent's exploratory scripts in Docker sandbox (read-only mount)
2b	Apply change	`compute: docker` only — agent applies the (validated) proposal to real codebase using Write/Edit tools only; no Bash on codebase
2c	Codex co-pilot	`--codex` only — required each iteration up to `MAX_CODEX_RUNS`; after cap reached, continue without Codex
3	Verify files	Always — check `git diff --stat`; skip to Phase 8 if no files changed (no-op)
4	Commit change	Always — stage modified files and commit before verifying metric
5	Verify metric	Always — run `metric_cmd` via `compute` mode (local/colab/docker); revert on timeout
6	Run guard	Always — run `guard_cmd` via `compute` mode; record pass or fail
7	Evaluate outcome	Always — keep, rework, or revert based on metric + guard result
7a	Write diary	Always — append one structured entry to `diary.md` recording hypothesis, outcome, and decision rationale
8	Write log	Always — append JSONL record, update `state.json`, print iteration summary, TaskUpdate R5 with result
9	Progress checks	Always — summary every SUMMARY_INTERVAL, stuck detection, diminishing-returns warn, early-stop check

Command execution rules (apply to ALL phases running external commands):

No compound commands: Never cd /path && command. Always two separate Bash calls — CWD persists between calls.
Use Bash tool timeout parameter: Never shell timeout wrapper. Pass timeout: <ms> on Bash tool call itself.
No inline multi-line Python: Python logic >3 lines → write to .experiments/state/<run-id>/scripts/script-.py via Write tool, execute with python <path> or uv run python <path>. Two triggers Claude Code always flags: (a) =([0-9.]+) inside -c "..." (false Zsh substitution); (b) multi-line -c "..." with #-prefixed comment lines. Writing to file sidesteps both.
No Zsh constructs: Never use =(), <(), >() in Bash commands — even inside quoted strings; Claude Code scans raw command text.
Local exploratory scripts writing to real files (scanning config combos, patching JSON, temp overrides): write to .experiments/state/<run-id>/scripts/, run locally with python <path>. Legitimately modify project files — NOT in Docker sandbox.
Docker sandbox (when available — see Phase 2a): Phases 4–6 route metric_cmd/guard_cmd through Docker when compute: docker. Phase 2a: read-only hypothesis scripts in sandbox. Scripts writing to project files always run locally.
One change per iteration: Never batch-loop over config variants/combos in single Bash/Python call. Each variant = one campaign iteration — loop/measure/compare is campaign framework's job, not ideation agent's.

Phase 0 — Print header

Print iteration header, update R5 task:

[→ Iter N/max_iterations — best so far: <best_metric> (Δ<best_delta_pct>% vs baseline)]

TaskUpdate R5 subject: R5: Iteration N/max_iterations — running

Phase 1 — Build context

Build context for ideation agent, write to file — do NOT accumulate inline in main context:

git log --oneline -10 >.experiments/state/${RUN_ID}/context-${I}.md  # timeout: 3000
tail -10 .experiments/state/${RUN_ID}/experiments.jsonl >>.experiments/state/${RUN_ID}/context-${I}.md  # timeout: 5000
# Fresh repos have <5 commits — fall back to full HEAD diff when shallow
if [ "$(git rev-list HEAD --count 2>/dev/null)" -gt 5 ]; then
    git diff --stat HEAD~5 HEAD >>.experiments/state/${RUN_ID}/context-${I}.md  # timeout: 3000
else
    git diff --stat HEAD >>.experiments/state/${RUN_ID}/context-${I}.md  # timeout: 3000
fi

Codemap structural context (only if CODEMAP_ENABLED=true — re-read from ${TMPDIR:-/tmp}/research-run-codemap-enabled): read $_RESEARCH_SHARED/codemap-context.md and execute its block. Leave TARGET_MODULE/TARGET_FN empty for the global central blast-radius baseline, or set TARGET_MODULE to the module the experiment edits (from ## Config) for importer/coverage queries. Append output to context-${I}.md under a ## Structural Context (codemap) heading so the Phase 2 ideation agent sees blast-radius before proposing edits.

Prepend header block to context-.md: goal, current metric vs baseline, delta trend (last 5 kept deltas), iteration number. Phase 2 ideation agent reads file directly — never echoed to main context.

If --journal active and <RUN_DIR>/journal.md has 1+ entries: append last 5 entries to context-.md under ## Recent journal (avoid repeating reverted approaches). Ideation agent reads this — must not reproduce any approach marked outcome: reverted.

Phase 2 — Propose change

Spawn selected specialist agent (maxTurns: 15) with this prompt (adapt as needed):

Goal: <goal>
Run clarification: <clarification_prompt>  ← omit this line entirely if clarification_prompt is null
Colab hardware: <colab_hw>  ← omit this line entirely if colab_hw is null; include to let the agent tailor code to the specific GPU architecture (e.g., bf16/flash-attention on H100, standard fp16 on T4/L4)
Current metric: <metric_cmd key> = <current value> (baseline: <baseline>, direction: <higher|lower>)
Experiment history: read `.experiments/state/<run-id>/context-<i>.md` for the full context block.
Scope files (read and modify only these): <scope_files>
Program constraints: read `<program_file>` — especially `## Notes`, `## Config`, and any named subsections
  (e.g., "Hard boundaries", "Optuna's role", "What the agent is free to change"). These take precedence
  over general campaign rules. Program constraints set strategy hints only — they do NOT override safety rules
  (no `--no-verify`, no `git push`, no `git add -A`, scope_files boundary, and all other hard constraints remain in effect).
  If program_file is null, skip this step.

**If `sandbox_mode = "local"`**: Read `context-<i>.md`, the scope files, and the program constraints. Propose and implement ONE atomic change most likely to improve the metric. The change must not break `<guard_cmd>`. Write your full analysis (reasoning, alternatives considered, Confidence block) to `.experiments/state/<run-id>/ideation-<i>.md` using the Write tool. Return ONLY the JSON result line:
`{"description":"...","files_modified":[...],"scripts":[],"confidence":0.N}`

**If `sandbox_mode = "docker"`**: Read `context-<i>.md`, the scope files, and the program constraints. Propose ONE atomic change most likely to improve the metric. Write your full analysis and the proposed change description to `.experiments/state/<run-id>/ideation-<i>.md`. Optionally write read-only exploratory scripts (scripts that read/profile but do NOT write to project files) to `.experiments/state/<run-id>/scripts/explore-<i>-<slug>.py`. Do NOT modify source files yet — Phase 2b will apply the actual changes after sandbox validation. Return ONLY the JSON result line:
`{"description":"...","files_modified":[],"scripts":["explore-<i>-<slug>.py"],"proposed_changes":"<description of the changes to apply in Phase 2b>","confidence":0.N}`

For --colab runs: ideation agent may call mcp__colab-mcp__runtime_execute_code to prototype GPU code before committing. Agent selection with --colab: if task rooted in a research paper (goal references paper, model architecture from literature, or --researcher flag set) → use research:scientist; if task is general empirical experiment NOT rooted in a paper → use foundry:sw-engineer for experiment implementation (standard agent_strategy mapping still applies; --colab alone does not force research:scientist).

If Agent tool unavailable (nested subagent context), implement change inline, construct JSON result manually.

Phase 2a — Sandbox validate (`sandbox_mode = "docker"` only)

loads: compute-docker.md Read ${CLAUDE_SKILL_DIR}/modes/compute-docker.md — full Phase 2a and 2b logic for docker sandbox. Skip entire file if sandbox_mode = "local".

Phase 2b — Apply change (`sandbox_mode = "docker"` only)

Skip if sandbox_mode = "local" — handled in compute-docker.md above.

Phase 2c — Codex co-pilot (`--codex` only)

Read ${CLAUDE_SKILL_DIR}/modes/codex-copilot.md — contains full Phase 2c logic, cost-bounded gate, Codex dispatch prompt, outcome handling, and stuck escalation.

Phase 3 — Verify files changed

git diff --stat. If no files changed (no-op): append to JSONL with status: no-op, skip to Phase 8 (log), continue loop.

Phase 4 — Commit change

Refresh commit sentinel before staging — R5 loop can exceed the 15-min sentinel TTL set in R5 setup. Slug computation unavoidably re-run (bash state lost between tool calls); path pattern identical to R5 setup block above:

# Refresh sentinel — bash state lost between calls; re-source slug (same form as R5 setup)
eval "$(bash "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/git_slugs.sh")"  # timeout: 3000
touch "${TMPDIR:-/tmp}/claude-commit-auth-${REPO_SLUG}-${BRANCH_SLUG}"  # timeout: 3000

Stage only modified files (never git add -A):

git add <files_modified from agent JSON>  # timeout: 3000
git commit -m "experiment(optimize/i<N>): <description>"  # timeout: 90000

If pre-commit hooks fail:

Delegate to foundry:linting-expert: provide failing hook output and modified files; ask to fix. Max 2 attempts.
If still failing after 2 attempts: git restore --staged <files_modified> + git restore <files_modified> to clean up (# <files_modified> = list of files returned by the iteration agent; restricts discard to iteration scope only), append status: hook-blocked, continue loop.

Phase 5 — Verify metric

loads: phase5-metric.md # also loads: codex-copilot.md, colab-setup.md, compute-docker.md, hypothesis-pipeline.md, report.md, resume.md, team.md Read ${CLAUDE_SKILL_DIR}/modes/phase5-metric.md — metric verification logic for docker, local, and colab sandbox modes.

Phase 6 — Run guard

If sandbox_mode = "docker": run guard_cmd in same Docker container as Phase 5 (same flags; no resource limits). Check exit code only.

If sandbox_mode = "local": run guard_cmd directly.

Record pass (exit 0) or fail (non-zero).

Phase 7 — Evaluate outcome

Condition	Action
metric improved AND guard pass	Keep commit. Update `state.json`: `best_metric`, `best_commit`. "Improved" = `new_metric > best_metric` when `direction: higher`; `new_metric < best_metric` when `direction: lower`.
metric improved AND guard fail	Rework: re-spawn agent with guard failure output. Max `GUARD_REWORK_MAX` (2) attempts. If still failing after all rework attempts: revert (`git revert HEAD --no-edit`); diary status = `"reverted"`, decision = `"Guard failed after GUARD_REWORK_MAX rework attempts — reverted"`.
metric improved AND gain < 0.1% AND change > 50 lines	Refresh sentinel; discard: `git revert HEAD --no-edit`. (Line count computed via `CHANGE_LINES` — see note below table.)
no improvement	Refresh sentinel; revert: `git revert HEAD --no-edit`.

Line count computation (for "gain < 0.1% AND change > 50 lines" row): run before evaluating the condition:

DIFF_SUMMARY=$(git diff --stat HEAD~1..HEAD | tail -1)  # timeout: 3000
INSERTIONS=$(echo "$DIFF_SUMMARY" | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo 0)
DELETIONS=$(echo "$DIFF_SUMMARY" | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo 0)
CHANGE_LINES=$(( INSERTIONS + DELETIONS ))

git revert HEAD --no-edit — never git reset --hard (preserves history, not in deny list).

Double-revert guard (ADV-H19) — Phase 7 rework→revert can collide with a partial Phase 5 timeout revert or Phase 6 guard-fail revert performed in the same iteration. Always check before issuing the revert:

ALREADY_REVERTED=$(git log --oneline -5 2>/dev/null | grep -c "^[0-9a-f]\+ Revert " || echo 0)
if [ "$ALREADY_REVERTED" -gt 0 ]; then
    echo "Phase 7: prior revert detected (Phase 5/6 already reverted this iteration) — skipping double-revert."
else
    git revert HEAD --no-edit  # timeout: 15000
fi

The guard fires on metric improved AND guard fail (after GUARD_REWORK_MAX attempts exhausted), no improvement, and gain < 0.1% AND change > 50 lines paths — any path that issues a revert after Phase 5 or Phase 6 may have already reverted.

Phase 7a — Write diary

After Phase 7 decision, append one entry to diary.md:

## Iteration N — <ISO timestamp>

**Hypothesis**: <agent's description from Phase 2 JSON — the proposed change and expected improvement>

**Outcome**: <metric_key> = <value> (Δ<delta>% vs baseline) — <kept|reverted|rework|no-op|hook-blocked|timeout>

**Decision**: <one sentence: why the outcome was accepted or rejected — e.g. "Metric improved 1.2% with guard passing" or "Reverted: metric regressed by 0.5%" or "Guard failed after 2 rework attempts">

---

For no-op iterations (no file changes):

## Iteration N — <ISO timestamp>

**Hypothesis**: <description> — no files modified

**Outcome**: no-op

**Decision**: Skipped (no changes made)

---

Phase 8 — Write log

Append one JSONL record to experiments.jsonl (same schema as baseline record in Step R4, plus ideation_source):

{
  "iteration": 1,
  "commit": "<sha of experiment commit or revert>",
  "metric": 0.0,
  "delta": 0.0,
  "guard": "pass|fail",
  "status": "kept|reverted|rework|no-op|hook-blocked|timeout",
  "description": "<agent description>",
  "agent": "<agent type>",
  "confidence": 0.0,
  "timestamp": "<ISO>",
  "files": [],
  "ideation_source": "claude"
}

ideation_source: "claude" = Claude specialist proposed; "codex" = Phase 2c proposed.

Update state.json: iteration = i, status = running.

Print iteration summary:

[✓ Iter N/max — <kept|reverted|no-op|...> · metric=<value> (Δ<delta>%) · agent=<agent_type>]

TaskUpdate R5 subject: R5: Iter N/max — last: <status>, best: <best_metric>

# Compaction contract — overwrite each iteration; always reflects latest in-progress state (compaction-contract.md §Lifecycle)
_RUN_ID=$(cat "${TMPDIR:-/tmp}/research-run-id" 2>/dev/null || echo "")
_KEEP=$(cat "${TMPDIR:-/tmp}/research-run-keep-items" 2>/dev/null || echo "")
_STATE_JSON=".experiments/state/${_RUN_ID}/state.json"
_ITER=$(jq -r '.iteration // 0' "$_STATE_JSON" 2>/dev/null || echo "?")
_BEST=$(jq -r '.best_metric // "?"' "$_STATE_JSON" 2>/dev/null || echo "?")
_PROG=$(jq -r '.program_file // ""' "$_STATE_JSON" 2>/dev/null || echo "")
_KEEP_APPEND=""; [ -n "$_KEEP" ] && _KEEP_APPEND="; user-keep: $_KEEP"
mkdir -p .claude/state  # timeout: 5000
{
    echo "## Active Skill Contract"
    echo "- skill: research:run · phase: iteration-loop (after iter ${_ITER})"
    echo "- run-dir: .experiments/${_RUN_ID}"
    echo "- preserve: state-json=${_STATE_JSON}, program=${_PROG}, iter=${_ITER}, best-metric=${_BEST}${_KEEP_APPEND}"
    echo "- next: continue R5 from iter $(( _ITER + 1 )) or proceed to R6 when loop done"
} > .claude/state/skill-contract.md  # timeout: 5000

Phase 9 — Progress checks

Summary every SUMMARY_INTERVAL iterations: print compact table (iteration, metric, delta, status) for last N iterations.
Stuck detection: if last STUCK_THRESHOLD entries all have status: reverted|no-op|hook-blocked, trigger escalation (see <constants>). Log escalation action.
Diminishing returns: if last DIMINISHING_RETURNS_WINDOW kept entries each improved < 0.5%, warn and suggest stopping. No auto-stop — user decides.
Early stop: if target set, stop when metric crosses it. Mark state.json status: goal-achieved.
Context compaction (every SUMMARY_INTERVAL): write full iteration summary to .experiments/state/<run-id>/progress-.md, discard verbose per-iteration details from working memory. Retain only: current metric, iteration count, JSONL path, best_commit. Full history recoverable from experiments.jsonl and ideation-.md.

After campaign loop completes (outside per-iteration loop):

# Fresh shell — $COMMIT_SENTINEL from R5 setup is gone; re-derive the path before rm
# (matches the Phase 4 re-derivation) or the cleanup is a silent no-op on "".
eval "$(bash "${CLAUDE_PLUGIN_ROOT:-plugins/research}/bin/git_slugs.sh")"  # timeout: 3000
rm -f "${TMPDIR:-/tmp}/claude-commit-auth-${REPO_SLUG}-${BRANCH_SLUG}"  # timeout: 3000  (best-effort; commit-guard.js owns lifecycle)

Step R6: Results report

Pre-compute branch before writing: BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo 'main')

mkdir -p .reports/research  # timeout: 3000

Write full report to .reports/research/run-$BRANCH-$(date +%Y-%m-%d).md via Write tool. Do not print to terminal. Anti-overwrite: if file exists, append counter suffix (e.g. -2.md): OUT=".reports/research/run-$BRANCH-$(date +%Y-%m-%d).md"; BASE="$OUT"; COUNT=2; while [ -f "$OUT" ]; do OUT="${BASE%.md}-${COUNT}.md"; COUNT=$((COUNT+1)); done

Read ${CLAUDE_SKILL_DIR}/modes/report.md state.json: status = completed.

Step R7: Codex delegation (optional)

Skip R7 if CODEX_DELEGATION_AVAILABLE=false (warning already printed at R2 — no further action needed).

Inspect applied changes (git diff <baseline_commit>...<best_commit> --stat), identify tasks Codex can complete (comments on non-obvious changes, docstrings for modified functions, test coverage). Read .claude/skills/_shared/codex-delegation.md and apply criteria.

Call AskUserQuestion tool after R7 output — do NOT write options as plain text. Map options into tool call:

question: "What next?"
(a) label: /research:retro — description: run post-run retrospective analysis
(b) label: /research:verify <paper> — description: verify implementation matches paper claims
(c) label: skip — description: no further action

rm -f .claude/state/skill-contract.md  # clear contract — campaign complete (compaction-contract.md §Lifecycle)  # timeout: 5000

Resume Mode

loads: resume.md Read and execute ${CLAUDE_SKILL_DIR}/modes/resume.md.

Mode: colab

loads: colab-setup.md Execute only when --colab flag active. Read and execute ${CLAUDE_SKILL_DIR:-plugins/research/skills/run}/modes/colab-setup.md.

Commit before verify — enables clean git revert HEAD if metric doesn't improve. Never verify before committing.
git revert over git reset --hard — preserves experiment history, not in deny list.
Never git add -A — always stage specific files returned by agent JSON.
Never --no-verify — if pre-commit hook blocks, delegate to foundry:linting-expert and fix.
Guard ≠ Verify — guard checks regressions (tests, lint); verify checks target metric. Both must pass to keep commit.
metric_cmd exit code ignored — R2 validates metric_cmd by parsing stdout for a float, not by checking exit code. Piping metric output through grep/awk/tr is acceptable; only the final stdout float matters.
Guard/metric scripts protected — ideation agent must not modify the files referenced in guard_cmd or metric_cmd; do not include them in scope_files. New test files may be created within scope_files for coverage improvement campaigns.
JSONL over TSV — richer structured fields, jq-parseable, no delimiter ambiguity; query with jq -c 'select(.status == "kept")' experiments.jsonl.
State persistence enables resume — if loop crashes/times out, resume picks up exactly where it stopped.
Safety break: hard cap = 50 iterations (values above 50 in program.md clamped to 50 with a warning); default 20 when max_iterations unset in program.md; skill never exceeds MAX_ITERATIONS.
Explicit flags = hard requirements: all flags (--colab, --compute=docker, --codex, --researcher, --architect) must be available at R2. If unavailable, stop — never silently degrade.
R7 Codex delegation requires /foundry:setup (requires foundry plugin) to have been run once — deploys codex-delegation.md to .claude/skills/_shared/; R7 is silently skipped if absent.

run

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

run

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Agent Resolution

Default Mode (Steps R1–R7)

Step R0: Hypothesis pre-phase (--researcher / --architect)

Step R1: Load / build config

Step R2: Precondition checks

Step R3: Select ideation agent

Step R4: Establish baseline (iteration 0)

Step R5: Iteration loop

Phase 0 — Print header

Phase 1 — Build context

Phase 2 — Propose change

Phase 2a — Sandbox validate (sandbox_mode = "docker" only)

Phase 2b — Apply change (sandbox_mode = "docker" only)

Phase 2c — Codex co-pilot (--codex only)

Phase 3 — Verify files changed

Phase 4 — Commit change

Phase 5 — Verify metric

Phase 6 — Run guard

Phase 7 — Evaluate outcome

Phase 7a — Write diary

Phase 8 — Write log

Phase 9 — Progress checks

Step R6: Results report

Step R7: Codex delegation (optional)

Resume Mode

Mode: colab

Similar Skills

Agent Resolution

Default Mode (Steps R1–R7)

Step R0: Hypothesis pre-phase (--researcher / --architect)

Step R1: Load / build config

Step R2: Precondition checks

Step R3: Select ideation agent

Step R4: Establish baseline (iteration 0)

Step R5: Iteration loop

Phase 0 — Print header

Phase 1 — Build context

Phase 2 — Propose change

Phase 2a — Sandbox validate (sandbox_mode = "docker" only)

Phase 2b — Apply change (sandbox_mode = "docker" only)

Phase 2c — Codex co-pilot (--codex only)

Phase 3 — Verify files changed

Phase 4 — Commit change

Phase 5 — Verify metric

Phase 6 — Run guard

Phase 7 — Evaluate outcome

Phase 7a — Write diary

Phase 8 — Write log

Phase 9 — Progress checks

Step R6: Results report

Step R7: Codex delegation (optional)

Resume Mode

Mode: colab

Similar Skills

Step R0: Hypothesis pre-phase (`--researcher` / `--architect`)

Phase 2a — Sandbox validate (`sandbox_mode = "docker"` only)

Phase 2b — Apply change (`sandbox_mode = "docker"` only)

Phase 2c — Codex co-pilot (`--codex` only)

Step R0: Hypothesis pre-phase (`--researcher` / `--architect`)

Phase 2a — Sandbox validate (`sandbox_mode = "docker"` only)

Phase 2b — Apply change (`sandbox_mode = "docker"` only)

Phase 2c — Codex co-pilot (`--codex` only)