From arcforge
Autonomously optimizes numeric metrics like build times, algorithm efficiency, prompt quality, or model performance via hypothesis-driven experiments with fixed evaluation and git-based resets.
npx claudepluginhub gregoryho/arcforge --plugin arcforgeThis skill uses the workspace's default tool permissions.
Autonomous iterative research: define a measurable optimization target, establish a baseline, then run a hypothesis-driven experiment loop until interrupted.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Autonomous iterative research: define a measurable optimization target, establish a baseline, then run a hypothesis-driven experiment loop until interrupted.
Core principle: "Fixed judge + free player" — the evaluation method is immutable (the judge), while the implementation is free to change (the player). By locking what you measure, you prevent moving goalposts during optimization.
git reset --hard HEAD~1 immediately.Agent proposes, human reacts, refine iteratively, then lock.
Step 1: Analyze Target
Step 2: Propose Draft Contract
research-config.md covering all 6 sections (below)Step 3: Refine with Human
Step 4: Lock the Contract
research-config.md to disk# Research Config: {target}
## Scope
CAN modify: {files/dirs the agent may change}
CANNOT modify: {files/dirs that are off-limits}
## Goal
Metric: {name, e.g., "build_time_seconds", "val_bpb"}
Direction: {lower-is-better | higher-is-better}
Target: {optional, e.g., "< 30s" or "none"}
## Strategy
Hypothesis playbook: {domain-specific approaches, ordered by likelihood}
Research sources: {docs URLs, reference implementations, config files}
First moves: {2-3 concrete experiments after baseline}
## Evaluation
Run command: {exact shell command, e.g., "npm run build 2>&1"}
Extract metric: {grep pattern, e.g., "grep -oP 'Time: \K[\d.]+' build.log"}
Timeout: {seconds per experiment}
Trials: {1 | 3 | 5 — runs per experiment; default 1 if omitted}
Aggregation: {median | mean — default median}
## Constraints
{secondary considerations, e.g., "keep memory under 4GB"}
## Autonomy
Mode: {run-until-interrupted | run-N-times | run-until-target}
## Simplicity Criterion
{Prefer simpler code when results are similar. Removing code for equal results is a win. "0.1% + 20 hacky lines? No." "0.1% from deleting code? Yes." "No improvement but simpler? Keep."}
| Judge Type | Signal Stability | Recommended Trials |
|---|---|---|
| Deterministic (build time, algorithm) | Stable ±2% | 1 |
| Semi-stochastic (E2E tests, flaky metrics) | Varies ±10% | 3 |
| Stochastic (LLM-graded eval, model behavior) | Varies ±30% | 5 with median |
The contract author decides at lock time, not the loop at runtime. If Trials is omitted from an existing contract, default to 1.
git checkout -b research/{tag}results.tsv with status baseline — do NOT commit results.tsv (keep it untracked so experiment history survives resets)node scripts/cli.js research dashboard --results results.tsv --config research-config.mdThis is the heart of the skill. NEVER STOP — run until interrupted or the stop condition from the contract is met.
LOOP (until stop condition):
1. READ STATE — git log, results.tsv, research-config.md
2. HYPOTHESIZE — pick a direction based on results so far
3. IMPLEMENT — modify files within declared scope only
4. COMMIT — git commit with descriptive message
5. RUN — execute command `trials` times → run-1.log, run-2.log, ... (never tee or raw stdout)
6. EXTRACT — grep metric from each log, compute aggregation (median/mean)
7. DECIDE — aggregated value improved? keep. Same/worse? revert. Crash? log + revert.
8. LOG — append row to results.tsv (every experiment, no exceptions)
9. ANALYZE — 3+ failures in same direction? change direction entirely
| Outcome | Action | Git | results.tsv Status |
|---|---|---|---|
| Metric improved | Keep the change | Keep commit | keep |
| Metric same or worse | Discard the change | git reset --hard HEAD~1 | discard |
| Command crashed/timed out | Log and discard | git reset --hard HEAD~1 | crash |
If 3 or more consecutive experiments fail in the same direction (e.g., all trying to reduce allocations):
Idea generation when stuck:
Two types of crashes — handle differently:
Dumb bug (typo, missing import, syntax error, off-by-one):
Fundamentally broken idea (OOM, algorithm doesn't converge, approach is wrong):
crash with the error in descriptiongit reset --hard HEAD~1Timeout: If the run exceeds the timeout, kill it and treat as a fundamentally broken idea.
Never count crashes toward the "3 failures → change direction" rule — crashes indicate broken code, not a bad hypothesis.
Long-running research burns context. Protect it:
command > run.log 2>&1 — never tee, never raw stdoutgrep "metric_pattern" run.log — never cat run.logtail -n 50 run.log — read stack traces, not full logsWhen the loop ends (interrupted, target reached, or max iterations):
results.tsvTab-separated values with header row:
commit metric_value status description
a1b2c3d 0.997 baseline Initial baseline measurement
b2c3d4e 0.891 keep Reduced learning rate by 50%
c3d4e5f 0.912 discard Added dropout layer 0.3 — regression from 0.891
d4e5f6g NaN crash Segfault in custom allocator — timeout after 300s
NaN for crashesbaseline, keep, discard, crashGit status: Keep results.tsv untracked. If committed, git reset after failed experiments will erase the log. The TSV is your persistent memory — it must survive resets.
If the agent is interrupted and resumes in a new session:
research-config.md → if exists, contract is already locked (skip Phase 1)results.tsv → understand all prior experimentsgit log → understand current code stateresearch/ → confirm research contextNever:
If results are suspicious:
| Rationalization | What to Do Instead |
|---|---|
| "The eval has a bug, let me fix it" | You're the player, not the judge. Stop and tell the human. |
| "The metric barely regressed, I'll keep it" | Binary rule: improved or not. Revert. |
| "I should ask the human about this" | You are autonomous. Decide, log reasoning, keep going. |
✓ RESEARCH COMPLETE
Target: {target name}
Baseline: {baseline value}
Best: {best value} ({improvement}% {direction})
Experiments: {total} ({kept} kept, {discarded} discarded, {crashed} crashed)
Branch: research/{tag}
Best commit: {hash}
✗ RESEARCH BLOCKED
Reason: {why the loop cannot continue}
Last experiment: {commit hash}
Suggestion: {what the human should investigate}
Before: arc-brainstorming → explore what to optimize and identify measurable targets
During: arc research dashboard for live monitoring
After: Review research/{tag} branch, cherry-pick or merge to main, run project tests