Search everything...

Skill

gh:optimize

Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.

Install

npx claudepluginhub wangrenzhu-ola/galeharnesscodingcli --plugin galeharness-cli

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Supporting Assets

README.mdreferences/example-hard-spec.yamlreferences/example-judge-spec.yamlreferences/experiment-log-schema.yamlreferences/experiment-prompt-template.mdreferences/judge-prompt-template.mdreferences/optimize-spec-schema.yamlreferences/usage-guide.mdscripts/experiment-worktree.shscripts/measure.shscripts/parallel-probe.sh

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

166.1k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

166.1k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

166.1k

Stats

Parent Repo Stars7

Parent Repo Forks3

Last CommitApr 24, 2026

Actions

View Source View Plugin View on GitHub View README

gh:optimize | galeharness-cli | ClaudePluginHub

Skill

gh:optimize

From galeharness-cli

Install

npx claudepluginhub wangrenzhu-ola/galeharnesscodingcli --plugin galeharness-cli

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Supporting Assets

SKILL.md

Iterative Optimization Loop

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Interaction Method

Use the platform's blocking question tool when available (AskUserQuestion in Claude Code, request_user_input in Codex, ask_user in Gemini, ask_user in Pi (requires the pi-ask-user extension)). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.

Input

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

Optimization Spec Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

Experiment Log Schema

Reference the experiment log schema for state management:

references/experiment-log-schema.yaml

Quick Start

For a first run, optimize for signal and safety, not maximum throughput:

Start from references/example-hard-spec.yaml when the metric is objective and cheap to measure
Use references/example-judge-spec.yaml only when actual quality requires semantic judgment
Prefer execution.mode: serial and execution.max_concurrent: 1
Cap the first run with stopping.max_iterations: 4 and stopping.max_hours: 1
Avoid new dependencies until the baseline and measurement harness are trusted
For judge mode, start with sample_size: 10, batch_size: 5, and max_total_cost_usd: 5

For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:

references/usage-guide.md

Persistence Discipline

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.

The files under .context/galeharness-cli/gh-optimize/<spec-name>/ are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.

This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.

If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.

Core Rules

Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the best section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a result.yaml marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.

Mandatory Disk Checkpoints

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.

Checkpoint	File Written	Phase
CP-0: Spec saved	`spec.yaml`	Phase 0, after user approval
CP-1: Baseline recorded	`experiment-log.yaml` (initial with baseline)	Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved	`experiment-log.yaml` (hypothesis_backlog section)	Phase 2, after hypothesis generation
CP-3: Each experiment result	`experiment-log.yaml` (append experiment entry)	Phase 3.3, immediately after each measurement
CP-4: Batch summary	`experiment-log.yaml` (outcomes + best) + `strategy-digest.md`	Phase 3.5, after batch evaluation
CP-5: Final summary	`experiment-log.yaml` (final state)	Phase 4, at wrap-up

Format of a verification step:

Write the file using the native file-write tool
Read the file back using the native file-read tool
Confirm the expected content is present
If verification fails, retry the write. If it fails twice, alert the user.

File Locations (all under `.context/galeharness-cli/gh-optimize/<spec-name>/`)

File	Purpose	Written When
`spec.yaml`	Optimization spec (immutable during run)	Phase 0 (CP-0)
`experiment-log.yaml`	Full history of all experiments	Initialized at CP-1, appended at CP-3, updated at CP-4
`strategy-digest.md`	Compressed learnings for hypothesis generation	Written at CP-4 after each batch
`<worktree>/result.yaml`	Per-experiment crash-recovery marker	Immediately after measurement, before CP-3

On Resume

When Phase 0.4 detects an existing run:

Read the experiment log from disk — this is the ground truth
Scan worktree directories for result.yaml markers not yet in the log
Recover any measured-but-unlogged experiments
Continue from where the log left off

Execution Flow

Phase -1: Task Lifecycle Start

Before any other action, log the skill start event so this execution appears on the task board:

Run gale-task log skill_started --skill gh:optimize --title "<optimization-goal>" to register this execution on the task board.
If gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.

Phase 0: Setup

0.1 Determine Input Type

Check whether the input is:

A spec file path (ends in .yaml or .yml): read and validate it
A description of the optimization goal: help the user create a spec interactively

0.2 Load or Create Spec

If spec file provided:

Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
Validate against references/optimize-spec-schema.yaml:
- All required fields present
- name is lowercase kebab-case and safe to use in git refs / worktree paths
- metric.primary.type is hard or judge
- If type is judge, metric.judge section exists with rubric and scoring
- At least one degenerate gate defined
- measurement.command is non-empty
- scope.mutable and scope.immutable each have at least one entry
- Gate check operators are valid (>=, <=, >, <, ==, !=)
- execution.max_concurrent is at least 1
- execution.max_concurrent does not exceed 6 when backend is worktree
If validation fails, report errors and ask the user to fix them

If description provided:

Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines type: hard vs type: judge and is the single most important spec decision:

Use type: hard when:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Use type: judge when:
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend type: judge. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on type: hard for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for type: judge):

Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

Walk through these questions:
- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:
```
stratification:
  - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
    count: 10
  - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
    count: 10
  - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
    count: 10
singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
```
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
Design the rubric (for type: judge):

Help the user define the scoring rubric. A good rubric:
- Has a 1-5 scale (or similar) with concrete descriptions for each level
- Includes supplementary fields that help diagnose issues (e.g., distinct_topics, outlier_count)
- Is specific enough that two judges would give similar scores
- Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
Example for clustering:
```
rubric: |
  Rate this cluster 1-5:
  - 5: All items clearly about the same issue/feature
  - 4: Strong theme, minor outliers
  - 3: Related but covers 2-3 sub-topics that could reasonably be split
  - 2: Weak connection — items share superficial similarity only
  - 1: Unrelated items grouped together
  Also report: distinct_topics (integer), outlier_count (integer)
```
Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend execution.mode: serial, execution.max_concurrent: 1, stopping.max_iterations: 4, and stopping.max_hours: 1
- If type: judge: recommend sample_size: 10, batch_size: 5, and max_total_cost_usd: 5 until the rubric and harness are trusted
Write the spec to .context/galeharness-cli/gh-optimize/<spec-name>/spec.yaml
Present the spec to the user for approval before proceeding

0.3 HKTMemory Retrieve

Before searching prior learnings, query the vector memory database for related optimization work:

Extract a 1-2 sentence search query from: the optimization goal, target metric, component names, approach being considered

Run (requires env vars HKT_MEMORY_API_KEY, HKT_MEMORY_BASE_URL, HKT_MEMORY_MODEL):

hkt-memory retrieve \
  --query "<extracted query>" \
  --layer all --limit 10 --min-similarity 0.35 \
  --vector-weight 0.7 --bm25-weight 0.3

If results returned, prepare a context block and use it to inform hypothesis generation:

## Related historical optimization experiences from HKTMemory
Source: vector database. Treat as additional context, not primary evidence.
[results here, each tagged with (similarity: X.XX)]

If no results or command error, proceed silently without blocking.

Integration with Phase 2: When HKTMemory returns relevant results, cross-reference them during hypothesis generation. Look for:

Similar optimization targets and what approaches worked or failed
Related metric improvements and the strategies that achieved them
Past experiment results on comparable code paths

0.4 Search Prior Learnings

Dispatch galeharness-cli:learnings-researcher to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.

0.5 Run Identity Detection

Check if optimize/<spec-name> branch already exists:

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

If branch exists, check for an existing experiment log at .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml.

Present the user with a choice via the platform question tool:

Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for result.yaml markers. Continue from the last iteration number in the log.
Fresh start: archive the old branch to optimize-archive/<spec-name>/archived-<timestamp>, clear the experiment log, start from scratch

0.6 Create Optimization Branch and Scratch Space

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

Create scratch directory:

mkdir -p .context/galeharness-cli/gh-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.

1.1 Clean-Tree Gate

Verify no uncommitted changes to files within scope.mutable or scope.immutable:

git status --porcelain

Filter the output against the scope paths. If any in-scope files have uncommitted changes:

Report which files are dirty
Ask the user to commit or stash before proceeding
Do NOT continue until the working tree is clean for in-scope files

1.2 Build or Validate Measurement Harness

If user provides a measurement harness (the measurement.command already exists):

Run it once via the measurement script:

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
If validation fails, report what is missing and ask the user to fix the harness

If agent must build the harness:

Analyze the codebase to understand the current approach and what should be measured
Build an evaluation script (e.g., evaluate.py, evaluate.sh, or equivalent)
Add the evaluation script path to scope.immutable -- the experiment agent must not modify it
Run it once and validate the output
Present the harness and its output to the user for review

1.3 Establish Baseline

Run the measurement harness on the current code.

If stability mode is repeat:

Run the harness repeat_count times
Aggregate results using the configured aggregation method (median, mean, min, max)
Calculate variance across runs
If variance exceeds noise_threshold, warn the user and suggest increasing repeat_count

Record the baseline in the experiment log:

baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

If primary type is judge, also run the judge evaluation on baseline output to establish the starting judge score.

1.4 Parallelism Readiness Probe

Run the parallelism probe script:

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.

1.5 Worktree Budget Check

Count existing worktrees:

bash scripts/experiment-worktree.sh count

If count + execution.max_concurrent would exceed 12:

Warn the user
Suggest cleaning up existing worktrees or reducing max_concurrent
Do NOT block -- the user may proceed at their own risk

1.6 Write Baseline to Disk (CP-1)

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:

Create the experiment log file at .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml
Include all required top-level sections from references/experiment-log-schema.yaml: spec, run_id, started_at, baseline, experiments, and best
Seed experiments as an empty array and seed best from the baseline snapshot (use iteration: 0, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
Optionally seed hypothesis_backlog: [] here as well so the log shape is stable before Phase 2 populates it
Verify: read the file back and confirm the required sections are present and the baseline values match
Only THEN present results to the user

1.7 User Approval Gate

Present to the user via the platform question tool:

Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
Experiment log location: show the file path so the user knows where results are saved
Parallel readiness: probe results, any blockers, mitigations applied
Clean-tree status: confirmed clean
Worktree budget: current count and projected usage
Judge budget: estimated per-experiment judge cost and configured max_total_cost_usd cap (or an explicit note that spend is uncapped)

Options:

Proceed -- approve baseline and parallel config, move to Phase 2
Adjust spec -- modify spec settings before proceeding
Fix issues -- user needs to resolve blockers first

Do NOT proceed to Phase 2 until the user explicitly approves.

If primary type is judge and max_total_cost_usd is null, call that out as uncapped spend and require explicit approval before proceeding.

State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.

Phase 2: Hypothesis Generation

2.1 Analyze Current Approach

Read the code within scope.mutable to understand:

The current implementation approach
Obvious improvement opportunities
Constraints and dependencies between components

Optionally dispatch galeharness-cli:repo-research-analyst for deeper codebase analysis if the scope is large or unfamiliar.

2.2 Generate Hypothesis List

Generate an initial set of hypotheses. Each hypothesis should have:

Description: what to try
Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
Priority: high, medium, or low based on expected impact and feasibility
Required dependencies: any new packages or tools needed

Include user-provided hypotheses if any were given as input.

Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.

2.3 Dependency Pre-Approval

Collect all unique new dependencies across all hypotheses.

If any hypotheses require new dependencies:

Present the full dependency list to the user via the platform question tool
Ask for bulk approval
Mark each hypothesis's dep_status as approved or needs_approval

Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.

2.4 Record Hypothesis Backlog (CP-2)

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:

hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

This phase repeats in batches until a stopping criterion is met.

3.1 Batch Selection

Select hypotheses for this batch:

Build a runnable backlog by excluding hypotheses with dep_status: needs_approval
If execution.mode is serial, force batch_size = 1
Otherwise, batch_size = min(runnable_backlog_size, execution.max_concurrent)
Prefer diversity: select from different categories when possible
Within a category, select by priority (high first)

If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.

3.2 Dispatch Experiments

For each hypothesis in the batch, dispatch according to execution.mode. In serial mode, run exactly one experiment to completion before selecting the next hypothesis. In parallel mode, dispatch the full batch concurrently.

Worktree backend:

Create experiment worktree:

WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>

Apply port parameterization if configured (set env vars for the measurement script)
Fill the experiment prompt template (references/experiment-prompt-template.md) with:
- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
Dispatch a subagent with the filled prompt, working in the experiment worktree

Codex backend:

Check environment guard -- do NOT delegate if already inside a Codex sandbox:

# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git

Fill the experiment prompt template
Write the filled prompt to a temp file

Dispatch via Codex:

cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1

Security posture: use the user's selection (ask once per session if not set in spec)

3.3 Collect and Persist Results

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.

For each completed experiment, immediately:

Run measurement in the experiment's worktree:
```
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
```
- If stability mode is repeat, run the measurement harness repeat_count times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
- Use the aggregated metrics as the experiment's score; if variance exceeds noise_threshold, record that in learnings so the operator knows the result is noisy.
Write crash-recovery marker — immediately after measurement, write result.yaml in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
Read raw JSON output from the measurement script
Evaluate degenerate gates:
- For each gate in metric.degenerate_gates, parse the operator and threshold
- Compare the metric value against the threshold
- If ANY gate fails: mark outcome as degenerate, skip judge evaluation, save money
If gates pass AND primary type is judge:
- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per metric.judge.stratification config (using sample_seed)
- Group samples into batches of metric.judge.batch_size
- Fill the judge prompt template (references/judge-prompt-template.md) for each batch
- Dispatch ceil(sample_size / batch_size) parallel judge sub-agents
- Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from metric.judge.scoring.primary (which should match metric.primary.name) plus any scoring.secondary values
- If singleton_sample > 0: also dispatch singleton evaluation sub-agents
If gates pass AND primary type is hard:
- Use the metric value directly from the measurement output
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml right now. Use the transitional outcome measured once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to kept, reverted, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.

Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to results.tsv after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.

3.4 Evaluate Batch

After all experiments in the batch have been measured:

Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using metric.primary.direction (maximize means higher is better, minimize means lower is better), and require the absolute improvement to exceed measurement.stability.noise_threshold before treating it as a real win
- For judge metrics: compare the configured primary judge score (metric.judge.scoring.primary / metric.primary.name) to the current best, and require it to exceed minimum_improvement
Identify the best experiment that passes all gates and improves the primary metric
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message optimize(<spec-name>): <hypothesis description> for the experiment commit
- After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
Check file-disjoint runners-up (up to max_runner_up_merges_per_batch):
- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome: runner_up_kept), then clean up that runner-up's experiment worktree and branch
- Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: runner_up_reverted), then clean up the runner-up's experiment worktree and branch
- Stop after first failed combination
Handle deferred deps: experiments that need unapproved dependencies get outcome deferred_needs_approval
Revert all others: cleanup worktrees, log as reverted

3.5 Update State (CP-4)

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.

Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark kept, reverted, runner_up_kept, etc.). Write these outcome updates to disk immediately.
Update the best section in the experiment log if a new best was found. Write to disk.
Write strategy digest to .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md:
- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.

CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the best section reflects the current best, (c) the hypothesis backlog is updated. Read strategy-digest.md back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.

Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.

3.6 Check Stopping Criteria

Stop the loop if ANY of these are true:

Target reached: stopping.target_reached is true, metric.primary.target is set, and the primary metric reaches that target according to metric.primary.direction (>= for maximize, <= for minimize)
Max iterations: total experiments run >= stopping.max_iterations
Max hours: wall-clock time since Phase 3 start >= stopping.max_hours
Judge budget exhausted: cumulative judge spend >= metric.judge.max_total_cost_usd (if set)
Plateau: no improvement for stopping.plateau_iterations consecutive experiments
Manual stop: user interrupts (save state and proceed to Phase 4)
Empty backlog: no hypotheses remain and no new ones can be generated

If no stopping criterion is met, proceed to the next batch (step 3.1).

3.7 Cross-Cutting Concerns

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.

Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:

Log as outcome error or timeout with the error message
Revert the experiment (cleanup worktree)
The loop continues with remaining experiments in the batch

Progress reporting: After each batch, report:

Batch N of estimated M (based on backlog size)
Experiments run this batch and total
Current best metric and improvement from baseline
Cumulative judge cost (if applicable)

Crash recovery: See Persistence Discipline section. Per-experiment result.yaml markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any result.yaml markers not yet reflected in the log.

Phase 4: Wrap-Up

4.1 Present Deferred Hypotheses

If any hypotheses were deferred due to unapproved dependencies:

List them with their dependency requirements
Ask the user whether to approve, skip, or save for a future run
If approved: add to backlog and offer to re-enter Phase 3 for one more round

4.2 Summarize Results

Present a comprehensive summary:

Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...

4.3 Preserve and Offer Next Steps

The optimization branch (optimize/<spec-name>) is preserved with all commits from kept experiments. The experiment log and strategy digest remain in local .context/... scratch space for resume and audit on this machine only; they do not travel with the branch because .context/ is gitignored.

Present post-completion options via the platform question tool:

Run /gh:review on the cumulative diff (baseline to final). Load the gh:review skill with mode:autofix on the optimization branch.
Run /gh:compound to document the winning strategy as an institutional learning.
Create PR from the optimization branch to the default branch.
Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
Done -- leave the optimization branch for manual review.

4.4 HKTMemory Store

After the wrap-up summary is presented:

Compose a concise summary (2-4 sentences) covering: the optimization goal, the winning strategy, key metric improvements (baseline -> final), and the spec name

Run:

hkt-memory store \
  --content "<summary with key metrics>" \
  --title "<spec-name> optimization" \
  --topic "optimize" \
  --layer all

Log: Stored to HKTMemory: [title] on success, or note the error (non-blocking — do not fail the optimization workflow if HKTMemory is unavailable).

Rationale: Optimization results are highly reusable — the winning strategy for one metric often applies to similar targets. Storing the approach and outcome helps future optimization sessions discover and build upon this work.

4.5 Cleanup

Clean up scratch space:

# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md

Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup. Do NOT delete experiment worktrees that are still being referenced.

After completing the optimization workflow, log the completion event:

Run gale-task log skill_completed to record the completion event.
If gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

166.1k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

166.1k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

166.1k

Stats

Parent Repo Stars7

Parent Repo Forks3

Last CommitApr 24, 2026

Actions

View Source View Plugin View on GitHub View README

Iterative Optimization Loop

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Interaction Method

Input

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

Optimization Spec Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

Experiment Log Schema

Reference the experiment log schema for state management:

references/experiment-log-schema.yaml

Quick Start

For a first run, optimize for signal and safety, not maximum throughput:

Start from references/example-hard-spec.yaml when the metric is objective and cheap to measure
Use references/example-judge-spec.yaml only when actual quality requires semantic judgment
Prefer execution.mode: serial and execution.max_concurrent: 1
Cap the first run with stopping.max_iterations: 4 and stopping.max_hours: 1
Avoid new dependencies until the baseline and measurement harness are trusted
For judge mode, start with sample_size: 10, batch_size: 5, and max_total_cost_usd: 5

For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:

references/usage-guide.md

Persistence Discipline

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.

This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.

Core Rules

Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the best section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a result.yaml marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.

Mandatory Disk Checkpoints

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.

Checkpoint	File Written	Phase
CP-0: Spec saved	`spec.yaml`	Phase 0, after user approval
CP-1: Baseline recorded	`experiment-log.yaml` (initial with baseline)	Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved	`experiment-log.yaml` (hypothesis_backlog section)	Phase 2, after hypothesis generation
CP-3: Each experiment result	`experiment-log.yaml` (append experiment entry)	Phase 3.3, immediately after each measurement
CP-4: Batch summary	`experiment-log.yaml` (outcomes + best) + `strategy-digest.md`	Phase 3.5, after batch evaluation
CP-5: Final summary	`experiment-log.yaml` (final state)	Phase 4, at wrap-up

Format of a verification step:

Write the file using the native file-write tool
Read the file back using the native file-read tool
Confirm the expected content is present
If verification fails, retry the write. If it fails twice, alert the user.

File Locations (all under `.context/galeharness-cli/gh-optimize/<spec-name>/`)

File	Purpose	Written When
`spec.yaml`	Optimization spec (immutable during run)	Phase 0 (CP-0)
`experiment-log.yaml`	Full history of all experiments	Initialized at CP-1, appended at CP-3, updated at CP-4
`strategy-digest.md`	Compressed learnings for hypothesis generation	Written at CP-4 after each batch
`<worktree>/result.yaml`	Per-experiment crash-recovery marker	Immediately after measurement, before CP-3

On Resume

When Phase 0.4 detects an existing run:

Read the experiment log from disk — this is the ground truth
Scan worktree directories for result.yaml markers not yet in the log
Recover any measured-but-unlogged experiments
Continue from where the log left off

Execution Flow

Phase -1: Task Lifecycle Start

Before any other action, log the skill start event so this execution appears on the task board:

Run gale-task log skill_started --skill gh:optimize --title "<optimization-goal>" to register this execution on the task board.
If gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.

Phase 0: Setup

0.1 Determine Input Type

Check whether the input is:

A spec file path (ends in .yaml or .yml): read and validate it
A description of the optimization goal: help the user create a spec interactively

0.2 Load or Create Spec

If spec file provided:

Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
Validate against references/optimize-spec-schema.yaml:
- All required fields present
- name is lowercase kebab-case and safe to use in git refs / worktree paths
- metric.primary.type is hard or judge
- If type is judge, metric.judge section exists with rubric and scoring
- At least one degenerate gate defined
- measurement.command is non-empty
- scope.mutable and scope.immutable each have at least one entry
- Gate check operators are valid (>=, <=, >, <, ==, !=)
- execution.max_concurrent is at least 1
- execution.max_concurrent does not exceed 6 when backend is worktree
If validation fails, report errors and ask the user to fix them

If description provided:

Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines type: hard vs type: judge and is the single most important spec decision:

Use type: hard when:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Use type: judge when:
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend type: judge. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on type: hard for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for type: judge):

Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

Walk through these questions:
- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:
```
stratification:
  - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
    count: 10
  - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
    count: 10
  - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
    count: 10
singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
```
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
Design the rubric (for type: judge):

Help the user define the scoring rubric. A good rubric:
- Has a 1-5 scale (or similar) with concrete descriptions for each level
- Includes supplementary fields that help diagnose issues (e.g., distinct_topics, outlier_count)
- Is specific enough that two judges would give similar scores
- Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
Example for clustering:
```
rubric: |
  Rate this cluster 1-5:
  - 5: All items clearly about the same issue/feature
  - 4: Strong theme, minor outliers
  - 3: Related but covers 2-3 sub-topics that could reasonably be split
  - 2: Weak connection — items share superficial similarity only
  - 1: Unrelated items grouped together
  Also report: distinct_topics (integer), outlier_count (integer)
```
Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend execution.mode: serial, execution.max_concurrent: 1, stopping.max_iterations: 4, and stopping.max_hours: 1
- If type: judge: recommend sample_size: 10, batch_size: 5, and max_total_cost_usd: 5 until the rubric and harness are trusted
Write the spec to .context/galeharness-cli/gh-optimize/<spec-name>/spec.yaml
Present the spec to the user for approval before proceeding

0.3 HKTMemory Retrieve

Before searching prior learnings, query the vector memory database for related optimization work:

Extract a 1-2 sentence search query from: the optimization goal, target metric, component names, approach being considered

Run (requires env vars HKT_MEMORY_API_KEY, HKT_MEMORY_BASE_URL, HKT_MEMORY_MODEL):

hkt-memory retrieve \
  --query "<extracted query>" \
  --layer all --limit 10 --min-similarity 0.35 \
  --vector-weight 0.7 --bm25-weight 0.3

If results returned, prepare a context block and use it to inform hypothesis generation:

## Related historical optimization experiences from HKTMemory
Source: vector database. Treat as additional context, not primary evidence.
[results here, each tagged with (similarity: X.XX)]

If no results or command error, proceed silently without blocking.

Integration with Phase 2: When HKTMemory returns relevant results, cross-reference them during hypothesis generation. Look for:

Similar optimization targets and what approaches worked or failed
Related metric improvements and the strategies that achieved them
Past experiment results on comparable code paths

0.4 Search Prior Learnings

Dispatch galeharness-cli:learnings-researcher to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.

0.5 Run Identity Detection

Check if optimize/<spec-name> branch already exists:

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

If branch exists, check for an existing experiment log at .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml.

Present the user with a choice via the platform question tool:

Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for result.yaml markers. Continue from the last iteration number in the log.
Fresh start: archive the old branch to optimize-archive/<spec-name>/archived-<timestamp>, clear the experiment log, start from scratch

0.6 Create Optimization Branch and Scratch Space

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

Create scratch directory:

mkdir -p .context/galeharness-cli/gh-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.

1.1 Clean-Tree Gate

Verify no uncommitted changes to files within scope.mutable or scope.immutable:

git status --porcelain

Filter the output against the scope paths. If any in-scope files have uncommitted changes:

Report which files are dirty
Ask the user to commit or stash before proceeding
Do NOT continue until the working tree is clean for in-scope files

1.2 Build or Validate Measurement Harness

If user provides a measurement harness (the measurement.command already exists):

Run it once via the measurement script:

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
If validation fails, report what is missing and ask the user to fix the harness

If agent must build the harness:

Analyze the codebase to understand the current approach and what should be measured
Build an evaluation script (e.g., evaluate.py, evaluate.sh, or equivalent)
Add the evaluation script path to scope.immutable -- the experiment agent must not modify it
Run it once and validate the output
Present the harness and its output to the user for review

1.3 Establish Baseline

Run the measurement harness on the current code.

If stability mode is repeat:

Run the harness repeat_count times
Aggregate results using the configured aggregation method (median, mean, min, max)
Calculate variance across runs
If variance exceeds noise_threshold, warn the user and suggest increasing repeat_count

Record the baseline in the experiment log:

baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

If primary type is judge, also run the judge evaluation on baseline output to establish the starting judge score.

1.4 Parallelism Readiness Probe

Run the parallelism probe script:

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

1.5 Worktree Budget Check

Count existing worktrees:

bash scripts/experiment-worktree.sh count

If count + execution.max_concurrent would exceed 12:

Warn the user
Suggest cleaning up existing worktrees or reducing max_concurrent
Do NOT block -- the user may proceed at their own risk

1.6 Write Baseline to Disk (CP-1)

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:

Create the experiment log file at .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml
Include all required top-level sections from references/experiment-log-schema.yaml: spec, run_id, started_at, baseline, experiments, and best
Seed experiments as an empty array and seed best from the baseline snapshot (use iteration: 0, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
Optionally seed hypothesis_backlog: [] here as well so the log shape is stable before Phase 2 populates it
Verify: read the file back and confirm the required sections are present and the baseline values match
Only THEN present results to the user

1.7 User Approval Gate

Present to the user via the platform question tool:

Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
Experiment log location: show the file path so the user knows where results are saved
Parallel readiness: probe results, any blockers, mitigations applied
Clean-tree status: confirmed clean
Worktree budget: current count and projected usage
Judge budget: estimated per-experiment judge cost and configured max_total_cost_usd cap (or an explicit note that spend is uncapped)

Options:

Proceed -- approve baseline and parallel config, move to Phase 2
Adjust spec -- modify spec settings before proceeding
Fix issues -- user needs to resolve blockers first

Do NOT proceed to Phase 2 until the user explicitly approves.

If primary type is judge and max_total_cost_usd is null, call that out as uncapped spend and require explicit approval before proceeding.

State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.

Phase 2: Hypothesis Generation

2.1 Analyze Current Approach

Read the code within scope.mutable to understand:

The current implementation approach
Obvious improvement opportunities
Constraints and dependencies between components

Optionally dispatch galeharness-cli:repo-research-analyst for deeper codebase analysis if the scope is large or unfamiliar.

2.2 Generate Hypothesis List

Generate an initial set of hypotheses. Each hypothesis should have:

Description: what to try
Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
Priority: high, medium, or low based on expected impact and feasibility
Required dependencies: any new packages or tools needed

Include user-provided hypotheses if any were given as input.

Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.

2.3 Dependency Pre-Approval

Collect all unique new dependencies across all hypotheses.

If any hypotheses require new dependencies:

Present the full dependency list to the user via the platform question tool
Ask for bulk approval
Mark each hypothesis's dep_status as approved or needs_approval

Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.

2.4 Record Hypothesis Backlog (CP-2)

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:

hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

This phase repeats in batches until a stopping criterion is met.

3.1 Batch Selection

Select hypotheses for this batch:

Build a runnable backlog by excluding hypotheses with dep_status: needs_approval
If execution.mode is serial, force batch_size = 1
Otherwise, batch_size = min(runnable_backlog_size, execution.max_concurrent)
Prefer diversity: select from different categories when possible
Within a category, select by priority (high first)

3.2 Dispatch Experiments

Worktree backend:

Create experiment worktree:

WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>

Apply port parameterization if configured (set env vars for the measurement script)
Fill the experiment prompt template (references/experiment-prompt-template.md) with:
- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
Dispatch a subagent with the filled prompt, working in the experiment worktree

Codex backend:

Check environment guard -- do NOT delegate if already inside a Codex sandbox:

# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git

Fill the experiment prompt template
Write the filled prompt to a temp file

Dispatch via Codex:

cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1

Security posture: use the user's selection (ask once per session if not set in spec)

3.3 Collect and Persist Results

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.

For each completed experiment, immediately:

Run measurement in the experiment's worktree:
```
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
```
- If stability mode is repeat, run the measurement harness repeat_count times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
- Use the aggregated metrics as the experiment's score; if variance exceeds noise_threshold, record that in learnings so the operator knows the result is noisy.
Write crash-recovery marker — immediately after measurement, write result.yaml in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
Read raw JSON output from the measurement script
Evaluate degenerate gates:
- For each gate in metric.degenerate_gates, parse the operator and threshold
- Compare the metric value against the threshold
- If ANY gate fails: mark outcome as degenerate, skip judge evaluation, save money
If gates pass AND primary type is judge:
- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per metric.judge.stratification config (using sample_seed)
- Group samples into batches of metric.judge.batch_size
- Fill the judge prompt template (references/judge-prompt-template.md) for each batch
- Dispatch ceil(sample_size / batch_size) parallel judge sub-agents
- Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from metric.judge.scoring.primary (which should match metric.primary.name) plus any scoring.secondary values
- If singleton_sample > 0: also dispatch singleton evaluation sub-agents
If gates pass AND primary type is hard:
- Use the metric value directly from the measurement output
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml right now. Use the transitional outcome measured once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to kept, reverted, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.

3.4 Evaluate Batch

After all experiments in the batch have been measured:

Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using metric.primary.direction (maximize means higher is better, minimize means lower is better), and require the absolute improvement to exceed measurement.stability.noise_threshold before treating it as a real win
- For judge metrics: compare the configured primary judge score (metric.judge.scoring.primary / metric.primary.name) to the current best, and require it to exceed minimum_improvement
Identify the best experiment that passes all gates and improves the primary metric
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message optimize(<spec-name>): <hypothesis description> for the experiment commit
- After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
Check file-disjoint runners-up (up to max_runner_up_merges_per_batch):
- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome: runner_up_kept), then clean up that runner-up's experiment worktree and branch
- Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: runner_up_reverted), then clean up the runner-up's experiment worktree and branch
- Stop after first failed combination
Handle deferred deps: experiments that need unapproved dependencies get outcome deferred_needs_approval
Revert all others: cleanup worktrees, log as reverted

3.5 Update State (CP-4)

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.

Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark kept, reverted, runner_up_kept, etc.). Write these outcome updates to disk immediately.
Update the best section in the experiment log if a new best was found. Write to disk.
Write strategy digest to .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md:
- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.

Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.

3.6 Check Stopping Criteria

Stop the loop if ANY of these are true:

Target reached: stopping.target_reached is true, metric.primary.target is set, and the primary metric reaches that target according to metric.primary.direction (>= for maximize, <= for minimize)
Max iterations: total experiments run >= stopping.max_iterations
Max hours: wall-clock time since Phase 3 start >= stopping.max_hours
Judge budget exhausted: cumulative judge spend >= metric.judge.max_total_cost_usd (if set)
Plateau: no improvement for stopping.plateau_iterations consecutive experiments
Manual stop: user interrupts (save state and proceed to Phase 4)
Empty backlog: no hypotheses remain and no new ones can be generated

If no stopping criterion is met, proceed to the next batch (step 3.1).

3.7 Cross-Cutting Concerns

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.

Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:

Log as outcome error or timeout with the error message
Revert the experiment (cleanup worktree)
The loop continues with remaining experiments in the batch

Progress reporting: After each batch, report:

Batch N of estimated M (based on backlog size)
Experiments run this batch and total
Current best metric and improvement from baseline
Cumulative judge cost (if applicable)

Phase 4: Wrap-Up

4.1 Present Deferred Hypotheses

If any hypotheses were deferred due to unapproved dependencies:

List them with their dependency requirements
Ask the user whether to approve, skip, or save for a future run
If approved: add to backlog and offer to re-enter Phase 3 for one more round

4.2 Summarize Results

Present a comprehensive summary:

Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...

4.3 Preserve and Offer Next Steps

Present post-completion options via the platform question tool:

Run /gh:review on the cumulative diff (baseline to final). Load the gh:review skill with mode:autofix on the optimization branch.
Run /gh:compound to document the winning strategy as an institutional learning.
Create PR from the optimization branch to the default branch.
Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
Done -- leave the optimization branch for manual review.

4.4 HKTMemory Store

After the wrap-up summary is presented:

Compose a concise summary (2-4 sentences) covering: the optimization goal, the winning strategy, key metric improvements (baseline -> final), and the spec name

Run:

hkt-memory store \
  --content "<summary with key metrics>" \
  --title "<spec-name> optimization" \
  --topic "optimize" \
  --layer all

Log: Stored to HKTMemory: [title] on success, or note the error (non-blocking — do not fail the optimization workflow if HKTMemory is unavailable).

4.5 Cleanup

Clean up scratch space:

# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md

After completing the optimization workflow, log the completion event:

Run gale-task log skill_completed to record the completion event.
If gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.

gh:optimize

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

gh:optimize

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Quick Start

Persistence Discipline

Core Rules

Mandatory Disk Checkpoints

File Locations (all under .context/galeharness-cli/gh-optimize/<spec-name>/)

On Resume

Execution Flow

Phase -1: Task Lifecycle Start

Phase 0: Setup

0.1 Determine Input Type

0.2 Load or Create Spec

0.3 HKTMemory Retrieve

0.4 Search Prior Learnings

0.5 Run Identity Detection

0.6 Create Optimization Branch and Scratch Space

Phase 1: Measurement Scaffolding

1.1 Clean-Tree Gate

1.2 Build or Validate Measurement Harness

1.3 Establish Baseline

1.4 Parallelism Readiness Probe

1.5 Worktree Budget Check

1.6 Write Baseline to Disk (CP-1)

1.7 User Approval Gate

Phase 2: Hypothesis Generation

2.1 Analyze Current Approach

2.2 Generate Hypothesis List

2.3 Dependency Pre-Approval

2.4 Record Hypothesis Backlog (CP-2)

Phase 3: Optimization Loop

3.1 Batch Selection

3.2 Dispatch Experiments

3.3 Collect and Persist Results

3.4 Evaluate Batch

3.5 Update State (CP-4)

3.6 Check Stopping Criteria

3.7 Cross-Cutting Concerns

Phase 4: Wrap-Up

4.1 Present Deferred Hypotheses

4.2 Summarize Results

4.3 Preserve and Offer Next Steps

4.4 HKTMemory Store

4.5 Cleanup

Similar Skills

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Quick Start

Persistence Discipline

Core Rules

Mandatory Disk Checkpoints

File Locations (all under .context/galeharness-cli/gh-optimize/<spec-name>/)

On Resume

Execution Flow

Phase -1: Task Lifecycle Start

Phase 0: Setup

0.1 Determine Input Type

0.2 Load or Create Spec

0.3 HKTMemory Retrieve

0.4 Search Prior Learnings

0.5 Run Identity Detection

File Locations (all under `.context/galeharness-cli/gh-optimize/<spec-name>/`)

File Locations (all under `.context/galeharness-cli/gh-optimize/<spec-name>/`)