From galeharness-cli
Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.
npx claudepluginhub wangrenzhu-ola/galeharnesscodingcli --plugin galeharness-cliThis skill uses the workspace's default tool permissions.
Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
README.mdreferences/example-hard-spec.yamlreferences/example-judge-spec.yamlreferences/experiment-log-schema.yamlreferences/experiment-prompt-template.mdreferences/judge-prompt-template.mdreferences/optimize-spec-schema.yamlreferences/usage-guide.mdscripts/experiment-worktree.shscripts/measure.shscripts/parallel-probe.shCreates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
Use the platform's blocking question tool when available (AskUserQuestion in Claude Code, request_user_input in Codex, ask_user in Gemini, ask_user in Pi (requires the pi-ask-user extension)). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
<optimization_input> #$ARGUMENTS </optimization_input>
If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
Reference the spec schema for validation:
references/optimize-spec-schema.yaml
Reference the experiment log schema for state management:
references/experiment-log-schema.yaml
For a first run, optimize for signal and safety, not maximum throughput:
references/example-hard-spec.yaml when the metric is objective and cheap to measurereferences/example-judge-spec.yaml only when actual quality requires semantic judgmentexecution.mode: serial and execution.max_concurrent: 1stopping.max_iterations: 4 and stopping.max_hours: 1sample_size: 10, batch_size: 5, and max_total_cost_usd: 5For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
references/usage-guide.md
CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.
The files under .context/galeharness-cli/gh-optimize/<spec-name>/ are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.
Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the best section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a result.yaml marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
| Checkpoint | File Written | Phase |
|---|---|---|
| CP-0: Spec saved | spec.yaml | Phase 0, after user approval |
| CP-1: Baseline recorded | experiment-log.yaml (initial with baseline) | Phase 1, after baseline measurement |
| CP-2: Hypothesis backlog saved | experiment-log.yaml (hypothesis_backlog section) | Phase 2, after hypothesis generation |
| CP-3: Each experiment result | experiment-log.yaml (append experiment entry) | Phase 3.3, immediately after each measurement |
| CP-4: Batch summary | experiment-log.yaml (outcomes + best) + strategy-digest.md | Phase 3.5, after batch evaluation |
| CP-5: Final summary | experiment-log.yaml (final state) | Phase 4, at wrap-up |
Format of a verification step:
.context/galeharness-cli/gh-optimize/<spec-name>/)| File | Purpose | Written When |
|---|---|---|
spec.yaml | Optimization spec (immutable during run) | Phase 0 (CP-0) |
experiment-log.yaml | Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 |
strategy-digest.md | Compressed learnings for hypothesis generation | Written at CP-4 after each batch |
<worktree>/result.yaml | Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 |
When Phase 0.4 detects an existing run:
result.yaml markers not yet in the logBefore any other action, log the skill start event so this execution appears on the task board:
gale-task log skill_started --skill gh:optimize --title "<optimization-goal>" to register this execution on the task board.gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.Check whether the input is:
.yaml or .yml): read and validate itIf spec file provided:
references/optimize-spec-schema.yaml:
name is lowercase kebab-case and safe to use in git refs / worktree pathsmetric.primary.type is hard or judgejudge, metric.judge section exists with rubric and scoringmeasurement.command is non-emptyscope.mutable and scope.immutable each have at least one entry>=, <=, >, <, ==, !=)execution.max_concurrent is at least 1execution.max_concurrent does not exceed 6 when backend is worktreeIf description provided:
Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines type: hard vs type: judge and is the single most important spec decision:
Use type: hard when:
Use type: judge when:
IMPORTANT: If the target is qualitative, strongly recommend type: judge. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
If the user insists on type: hard for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for type: judge):
Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"
Walk through these questions:
Example stratified sampling for clustering:
stratification:
- bucket: "top_by_size" # largest clusters — check for degenerate mega-clusters
count: 10
- bucket: "mid_range" # middle of non-solo cluster size range — representative quality
count: 10
- bucket: "small_clusters" # clusters with 2-3 items — check if connections are real
count: 10
singleton_sample: 15 # singletons — check for false negatives (items that should cluster)
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".
Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
Design the rubric (for type: judge):
Help the user define the scoring rubric. A good rubric:
distinct_topics, outlier_count)Example for clustering:
rubric: |
Rate this cluster 1-5:
- 5: All items clearly about the same issue/feature
- 4: Strong theme, minor outliers
- 3: Related but covers 2-3 sub-topics that could reasonably be split
- 2: Weak connection — items share superficial similarity only
- 1: Unrelated items grouped together
Also report: distinct_topics (integer), outlier_count (integer)
Guide the user through the remaining spec fields:
execution.mode: serial, execution.max_concurrent: 1, stopping.max_iterations: 4, and stopping.max_hours: 1type: judge: recommend sample_size: 10, batch_size: 5, and max_total_cost_usd: 5 until the rubric and harness are trustedWrite the spec to .context/galeharness-cli/gh-optimize/<spec-name>/spec.yaml
Present the spec to the user for approval before proceeding
Before searching prior learnings, query the vector memory database for related optimization work:
hkt-memory retrieve \
--query "<extracted query>" \
--layer all --limit 10 --min-similarity 0.35 \
--vector-weight 0.7 --bm25-weight 0.3
## Related historical optimization experiences from HKTMemory
Source: vector database. Treat as additional context, not primary evidence.
[results here, each tagged with (similarity: X.XX)]
Integration with Phase 2: When HKTMemory returns relevant results, cross-reference them during hypothesis generation. Look for:
Dispatch galeharness-cli:learnings-researcher to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
Check if optimize/<spec-name> branch already exists:
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
If branch exists, check for an existing experiment log at .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml.
Present the user with a choice via the platform question tool:
result.yaml markers. Continue from the last iteration number in the log.optimize-archive/<spec-name>/archived-<timestamp>, clear the experiment log, start from scratchgit checkout -b "optimize/<spec-name>" # or switch to existing if resuming
Create scratch directory:
mkdir -p .context/galeharness-cli/gh-optimize/<spec-name>/
This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.
Verify no uncommitted changes to files within scope.mutable or scope.immutable:
git status --porcelain
Filter the output against the scope paths. If any in-scope files have uncommitted changes:
If user provides a measurement harness (the measurement.command already exists):
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
If agent must build the harness:
evaluate.py, evaluate.sh, or equivalent)scope.immutable -- the experiment agent must not modify itRun the measurement harness on the current code.
If stability mode is repeat:
repeat_count timesnoise_threshold, warn the user and suggest increasing repeat_countRecord the baseline in the experiment log:
baseline:
timestamp: "<current ISO 8601 timestamp>"
gates:
<gate_name>: <value>
...
diagnostics:
<diagnostic_name>: <value>
...
If primary type is judge, also run the judge evaluation on baseline output to establish the starting judge score.
Run the parallelism probe script:
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>
Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.
Count existing worktrees:
bash scripts/experiment-worktree.sh count
If count + execution.max_concurrent would exceed 12:
max_concurrentMANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:
.context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yamlreferences/experiment-log-schema.yaml: spec, run_id, started_at, baseline, experiments, and bestexperiments as an empty array and seed best from the baseline snapshot (use iteration: 0, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare againsthypothesis_backlog: [] here as well so the log shape is stable before Phase 2 populates itPresent to the user via the platform question tool:
max_total_cost_usd cap (or an explicit note that spend is uncapped)Options:
Do NOT proceed to Phase 2 until the user explicitly approves.
If primary type is judge and max_total_cost_usd is null, call that out as uncapped spend and require explicit approval before proceeding.
State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.
Read the code within scope.mutable to understand:
Optionally dispatch galeharness-cli:repo-research-analyst for deeper codebase analysis if the scope is large or unfamiliar.
Generate an initial set of hypotheses. Each hypothesis should have:
Include user-provided hypotheses if any were given as input.
Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.
Collect all unique new dependencies across all hypotheses.
If any hypotheses require new dependencies:
dep_status as approved or needs_approvalHypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.
MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:
hypothesis_backlog:
- description: "Remove template boilerplate before embedding"
category: "signal-extraction"
priority: high
dep_status: approved
required_deps: []
- description: "Try HDBSCAN clustering algorithm"
category: "algorithm"
priority: medium
dep_status: needs_approval
required_deps: ["scikit-learn"]
This phase repeats in batches until a stopping criterion is met.
Select hypotheses for this batch:
dep_status: needs_approvalexecution.mode is serial, force batch_size = 1batch_size = min(runnable_backlog_size, execution.max_concurrent)If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.
For each hypothesis in the batch, dispatch according to execution.mode. In serial mode, run exactly one experiment to completion before selecting the next hypothesis. In parallel mode, dispatch the full batch concurrently.
Worktree backend:
WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>) # creates optimize-exp/<spec_name>/exp-<NNN>
references/experiment-prompt-template.md) with:
Codex backend:
# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.
For each completed experiment, immediately:
Run measurement in the experiment's worktree:
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
repeat, run the measurement harness repeat_count times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.noise_threshold, record that in learnings so the operator knows the result is noisy.Write crash-recovery marker — immediately after measurement, write result.yaml in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
Read raw JSON output from the measurement script
Evaluate degenerate gates:
metric.degenerate_gates, parse the operator and thresholddegenerate, skip judge evaluation, save moneyIf gates pass AND primary type is judge:
metric.judge.stratification config (using sample_seed)metric.judge.batch_sizereferences/judge-prompt-template.md) for each batchceil(sample_size / batch_size) parallel judge sub-agentsmetric.judge.scoring.primary (which should match metric.primary.name) plus any scoring.secondary valuessingleton_sample > 0: also dispatch singleton evaluation sub-agentsIf gates pass AND primary type is hard:
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to .context/galeharness-cli/gh-optimize/<spec-name>/experiment-log.yaml right now. Use the transitional outcome measured once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to kept, reverted, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.
Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to results.tsv after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.
After all experiments in the batch have been measured:
Rank experiments by primary metric improvement:
metric.primary.direction (maximize means higher is better, minimize means lower is better), and require the absolute improvement to exceed measurement.stability.noise_threshold before treating it as a real winmetric.judge.scoring.primary / metric.primary.name) to the current best, and require it to exceed minimum_improvementIdentify the best experiment that passes all gates and improves the primary metric
If best improves on current best: KEEP
optimize(<spec-name>): <hypothesis description> for the experiment commitCheck file-disjoint runners-up (up to max_runner_up_merges_per_batch):
runner_up_kept), then clean up that runner-up's experiment worktree and branchrunner_up_reverted), then clean up the runner-up's experiment worktree and branchHandle deferred deps: experiments that need unapproved dependencies get outcome deferred_needs_approval
Revert all others: cleanup worktrees, log as reverted
MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.
Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark kept, reverted, runner_up_kept, etc.). Write these outcome updates to disk immediately.
Update the best section in the experiment log if a new best was found. Write to disk.
Write strategy digest to .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md:
Generate new hypotheses based on learnings:
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.
CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the best section reflects the current best, (c) the hypothesis backlog is updated. Read strategy-digest.md back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.
Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.
Stop the loop if ANY of these are true:
stopping.target_reached is true, metric.primary.target is set, and the primary metric reaches that target according to metric.primary.direction (>= for maximize, <= for minimize)stopping.max_iterationsstopping.max_hoursmetric.judge.max_total_cost_usd (if set)stopping.plateau_iterations consecutive experimentsIf no stopping criterion is met, proceed to the next batch (step 3.1).
Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.
Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:
error or timeout with the error messageProgress reporting: After each batch, report:
Crash recovery: See Persistence Discipline section. Per-experiment result.yaml markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any result.yaml markers not yet reflected in the log.
If any hypotheses were deferred due to unapproved dependencies:
Present a comprehensive summary:
Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
Kept: <count> (including <runner_up_kept_count> runner-up merges)
Reverted: <count>
Degenerate: <count>
Errors: <count>
Deferred: <count>
Baseline -> Final:
<primary_metric>: <baseline_value> -> <final_value> (<delta>)
<gate_metrics>: ...
<diagnostics>: ...
Judge cost: $<total_judge_cost_usd> (if applicable)
Key improvements:
1. <kept experiment 1 hypothesis> (+<delta>)
2. <kept experiment 2 hypothesis> (+<delta>)
...
The optimization branch (optimize/<spec-name>) is preserved with all commits from kept experiments.
The experiment log and strategy digest remain in local .context/... scratch space for resume and audit on this machine only; they do not travel with the branch because .context/ is gitignored.
Present post-completion options via the platform question tool:
/gh:review on the cumulative diff (baseline to final). Load the gh:review skill with mode:autofix on the optimization branch./gh:compound to document the winning strategy as an institutional learning.After the wrap-up summary is presented:
hkt-memory store \
--content "<summary with key metrics>" \
--title "<spec-name> optimization" \
--topic "optimize" \
--layer all
Stored to HKTMemory: [title] on success, or note the error (non-blocking — do not fail the optimization workflow if HKTMemory is unavailable).Rationale: Optimization results are highly reusable — the winning strategy for one metric often applies to similar targets. Storing the approach and outcome helps future optimization sessions discover and build upon this work.
Clean up scratch space:
# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/galeharness-cli/gh-optimize/<spec-name>/strategy-digest.md
Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup. Do NOT delete experiment worktrees that are still being referenced.
After completing the optimization workflow, log the completion event:
gale-task log skill_completed to record the completion event.gale-task is not on PATH or the command fails, skip and continue — this must never block the skill.