Help us improve
Share bugs, ideas, or general feedback.
From bms
Autonomous iterative optimization loop for Python, SQL, ML, and Spark/Databricks. Define a measurable goal; the skill runs code changes, measures, and keeps/reverts until you stop it or the budget runs out. Use for: Python runtime (cProfile, hyperfine), SQL latency, pytest pass rate/coverage, ty/pyright errors, ML metrics (loss/accuracy/F1), memory, PySpark/Spark Connect. Trigger on: "optimize this", "autoopt", "autoresearch", "keep trying until it's faster", "hill-climb this", "run experiments", "iterate autonomously", "optimize SQL performance", "tune this model", "optimize this Spark job", "tune my PySpark pipeline", "improve test coverage automatically". DO NOT USE FOR: one-shot fixes, code review without a metric, tasks with no measurable outcome.
npx claudepluginhub bmsuisse/skills --plugin writingHow this skill is triggered — by the user, by Claude, or both
Slash command
/bms:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You define the goal and how to measure it. The agent does the rest: hypothesize, edit,
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Guides systematic root-cause debugging when tests fail, builds break, or unexpected errors occur. Provides a structured triage checklist to preserve evidence, localize, and fix issues instead of guessing.
Share bugs, ideas, or general feedback.
You define the goal and how to measure it. The agent does the rest: hypothesize, edit, measure, keep or revert — running autonomously until you stop it or the budget runs out.
Inspired by Karpathy's autoresearch, adapted for Python, SQL, pytest, ty, and ML workflows.
[!IMPORTANT] Every experiment is committed before running, and reverted on failure. The branch only ever advances on real improvements.
Work through these questions with the user before touching any code. Do not skip or assume answers.
Ask:
What are you trying to improve?
Examples: execution time, memory usage, pytest pass rate, ty error count, SQL query latency, model accuracy, training throughput, bundle size.
Ask:
What command produces the metric, and how do I read the number from its output?
- Command — the exact shell command to run
- Extraction — regex, line number, JSON path, or description of what to parse
- Direction — lower is better, or higher is better?
Refer to the domain quick-reference below for ready-made commands if the user is unsure.
Record:
METRIC_COMMANDMETRIC_EXTRACTIONMETRIC_DIRECTION (lower_is_better | higher_is_better)Ask:
Which files or directories may I edit? Which are off-limits?
Record:
IN_SCOPE — files/dirs the agent may modifyOUT_OF_SCOPE — must not be touchedAsk:
Any constraints I should respect?
Examples: no new dependencies, must keep existing tests green, public API must stay stable, max 2 min per run, must stay type-clean (ty 0 errors), VRAM budget, complexity budget.
Record as CONSTRAINTS.
Ask:
May I search the web for optimization ideas, documentation, or techniques?
Web search lets me look up library docs, algorithm papers, Stack Overflow answers, and benchmarking guides to generate better hypotheses — especially useful for ML tuning, SQL optimization, and unfamiliar libraries.
Options:
yes(search freely),ask(propose each query before running),no(stay offline, codebase only).
Record as WEB_SEARCH (yes | ask | no, default no).
Ask:
How many experiments, or keep going until interrupted?
Record as MAX_EXPERIMENTS (number or unlimited).
State the default and ask for adjustments:
Default: simpler beats marginally faster. Removing code while holding or improving the metric is a win. Complexity has a cost — weigh it honestly against the gain. OK to proceed with this policy, or do you want to adjust it?
Record any adjustment as SIMPLICITY_POLICY.
Present a summary table and wait for explicit confirmation before continuing.
| Parameter | Value |
|---|---|
| Goal | |
| Metric command | |
| Metric extraction | |
| Direction | |
| In-scope | |
| Out-of-scope | |
| Constraints | |
| Max experiments | |
| Simplicity policy | |
| Web search |
Run multiple experiment loops simultaneously, each on its own branch and working
directory, using git worktree. Useful when you want to race two hypotheses, explore
independent scopes concurrently, or keep your main editor on main while experiments run.
For each parallel run, create a dedicated worktree before starting its loop:
# In the repo root — repeat for each parallel run
RUN_ID_A="<goal-a>" # e.g. etl-runtime
RUN_ID_B="<goal-b>" # e.g. ty-errors
git worktree add ../<repo>-${RUN_ID_A} -b autoresearch/${RUN_ID_A}
git worktree add ../<repo>-${RUN_ID_B} -b autoresearch/${RUN_ID_B}
Each worktree is a sibling directory with its own HEAD and index — commits,
resets, and file edits in one worktree are completely invisible to the others.
| Rule | Reason |
|---|---|
Non-overlapping IN_SCOPE | Two loops editing the same file will overwrite each other's changes — the merge conflict is yours to resolve |
| Each run gets its own worktree | Never share a directory between two loops |
Separate RESULTS_FILE per run | Use the unique RUN_ID — they naturally don't collide |
| Merge order matters | When both runs finish, merge the one with the larger improvement first; re-run the other's baseline before merging it to get an honest combined measurement |
| VS Code workspace | File → Add Folder to Workspace for each worktree directory — all runs visible side-by-side |
When a parallel run finishes, remove its worktree:
git worktree remove ../<repo>-${RUN_ID_A} # removes directory + unregisters
# or, if the run produced no improvement:
git worktree remove --force ../<repo>-${RUN_ID_A}
git branch -D autoresearch/${RUN_ID_A}
List active worktrees at any time:
git worktree list
Generate a run ID — a short goal slug, e.g. pytest-passrate.
This ID is used for all artifacts of this run so multiple runs never collide.
RUN_ID="<goal-slug>" # e.g. pytest-passrate
RESULTS_FILE="autoresearch-${RUN_ID}.tsv"
LOG_FILE="autoresearch-${RUN_ID}.log"
Create a branch — propose autoresearch/<run-id>, create it:
git checkout -b autoresearch/${RUN_ID}
Read in-scope files — build full context before making any changes.
Initialize $RESULTS_FILE in the repo root:
experiment commit metric status description
Register both files in .git/info/exclude (append only — never modify tracked files):
echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}" >> .git/info/exclude
Run the baseline — execute METRIC_COMMAND, extract the value, record as
experiment 0 with status baseline.
Report to the user:
Baseline: [metric] = [value]. Run ID:
<run-id>. Starting experiment loop.
Run continuously. Never pause to ask "should I continue?". Stop only when:
MAX_EXPERIMENTS is reached, orQUESTION (Questioner) Profile the current state before hypothesizing.
Ask structured questions about the code/data to surface bottlenecks:
1. What changed since last iteration? (diff awareness)
2. Where is time/resource actually spent? (profile or re-read metrics)
3. What patterns in prior results suggest a direction?
- 2+ consecutive keeps in same area → probe deeper
- 3+ discards → pivot strategy
- crash → structural issue, not parameter issue
4. Are there external signals to incorporate?
(error messages, profiler output, log patterns)
Record answers as QUESTIONER_NOTES for this iteration.
THINK Synthesize QUESTIONER_NOTES into a hypothesis.
Use domain reasoning vocabulary (see below) to sharpen the hypothesis.
Form: "X should improve Y because Z."
If WEB_SEARCH is `yes` or `ask` and you are stuck or entering a new
domain (e.g. unfamiliar library, ML algorithm, SQL planner behavior),
search the web for relevant techniques, docs, or benchmarks before
forming the hypothesis. For `ask`, state the proposed query and wait
for confirmation. Log the source URL in the description column of the
TSV when a web result directly inspired the experiment.
Follow experiment strategy priority below.
SCORE Rate the hypothesis before acting (1–10 each):
- Impact: how much metric improvement expected?
- Feasibility: how likely to work without breaking things?
- Novelty: how different from what was already tried? (check $RESULTS_FILE)
Average ≥ 5 → proceed. Below 5 → generate a better hypothesis.
Skip SCORE on experiment #1 (no prior data to compare against).
REFLECT Self-check before editing:
- What assumption am I making that could be wrong?
- Has something similar already been tried and failed? (scan $RESULTS_FILE)
- Am I stuck in a local optimum? (3+ keeps in same area → try different axis)
- Could this change make things worse in a way I won't measure?
If reflection reveals a flaw → revise hypothesis and re-SCORE.
EDIT Make one focused change to in-scope files.
Keep it minimal — one idea per experiment.
COMMIT Stage only in-scope files, then commit:
git add <IN_SCOPE files> && git commit
Message format: "experiment: <short description>"
RUN Execute METRIC_COMMAND, redirect all output:
<command> > $LOG_FILE 2>&1
MEASURE Extract the metric from $LOG_FILE.
On failure: read the last 50 lines of run.log for the error.
INSPECT (Inspector) Validate the experiment beyond just the metric.
Checklist — all must pass for a "keep" decision:
☐ Metric improved (or held, if simplification pass)
☐ Change is minimal and focused (one idea, not a kitchen sink)
☐ No unrelated regressions introduced (test suite, type checker)
☐ Code complexity did not increase disproportionately to the gain
☐ The change is understandable — would a reviewer accept it?
☐ No hardcoded magic values that only work for the benchmark
☐ Description matches actual change (no hallucinated improvements)
☐ Metric improvement is real, not positive spin on noise
If any check fails, downgrade to "discard" even if metric improved.
Record inspector verdict as INSPECTOR_NOTES.
DECIDE Compare to current best (incorporating INSPECTOR_NOTES):
✅ IMPROVED → keep commit, update best, log status = "keep"
❌ SAME/WORSE → revert only in-scope files:
git reset HEAD~1 # soft-reset: undo commit, keep working tree
git restore <IN_SCOPE files> # discard changes to in-scope files only
log status = "discard"
⚠️ METRIC UP BUT INSPECTOR FAIL → revert, log status = "inspector-reject"
Record why: "metric +12% but added 40 LOC of unmaintainable code"
💥 CRASH → attempt up to 2 quick fixes (typo, import, simple error),
amend commit, re-run. If still broken, soft-reset and
restore only in-scope files; log status = "crash".
LOG Append to results.tsv:
<N> <commit> <value> <status> <description>
Insert these domain-specific keywords into your THINK step to sharpen hypothesis quality. Using precise terminology activates better reasoning patterns.
| Domain | Keywords to use in hypotheses |
|---|---|
| Python perf | bottleneck, hot path, cache locality, allocation pressure, GIL contention, vectorize, amortize |
| SQL perf | cardinality, selectivity, partition pruning, predicate pushdown, shuffle, skew, broadcast threshold |
| ML training | learning rate schedule, gradient norm, batch size saturation, overfitting signal, regularization strength |
| Test coverage | uncovered branch, edge case, boundary condition, error path, mock boundary |
| Memory | peak allocation, fragmentation, object lifetime, reference cycle, weak reference |
| General | invariant, precondition, tight loop, early exit, short-circuit, amortized cost |
Follow this priority order:
uv run ty check after each
change and revert if new errors appearWhen the loop ends (budget reached or interrupted), work through all four sub-phases.
Print $RESULTS_FILE as a formatted table.
Summarize:
Show the git log of kept experiments:
git log --oneline <start-commit>..HEAD
After reporting results, perform a structured review of the entire optimization run. Rate the run on these axes (1–10):
| Axis | Question |
|---|---|
| Soundness | Are the metric improvements real and reproducible, or could they be measurement artifacts? |
| Quality | Is the final code better than baseline? Would a senior engineer approve the diff? |
| Significance | Is the improvement meaningful enough to justify the complexity introduced? |
| Completeness | Were the most promising directions explored, or did the loop get stuck early? |
Present scores and a 2–3 sentence summary. Flag:
If the run-level review reveals concerns, note them in the report before cleanup.
Remove run artifacts that are no longer needed. Ask the user once before deleting:
Run complete. Clean up
$RESULTS_FILEand$LOG_FILEfrom the working directory? (They stay in git history if you need them later.)
If confirmed:
rm -f "${RESULTS_FILE}" "${LOG_FILE}"
Also remove the excludes entries added in Phase 2 so the file is left tidy:
# removes the two lines added during setup (grep -v is safe here — no tracked files touched)
grep -v "^${RESULTS_FILE}$\|^${LOG_FILE}$" .git/info/exclude > /tmp/_exclude_tmp \
&& mv /tmp/_exclude_tmp .git/info/exclude
Present these as a checklist. Mark which apply based on what the run actually changed.
Code quality
/deslop on changed files — automated optimization often leaves mechanical
patterns, naming inconsistencies, or removed comments that need a passty across the full project to confirm no new type errors leaked inCommit hygiene
bash git rebase -i <start-commit>
Replace all experiment: commits with meaningful messages describing what changed
and why it helped.Integration
autoresearch/<run-id> into your base branch$RESULTS_FILE (or paste the
summary table) so reviewers understand the methodologyFurther experimentation
If the run produced no net improvement, offer to delete the branch cleanly:
git checkout <base-branch>
git branch -D autoresearch/${RUN_ID}
If improvements were made, leave the branch for PR review.
Use these when the user isn't sure how to phrase their metric command.
# hyperfine (install: brew install hyperfine / pip install hyperfine)
hyperfine --warmup 3 'python my_script.py'
# → parse: "mean" field (lower is better)
# built-in timing
python -m timeit -n 100 -r 5 "import my_module; my_module.run()"
# cProfile summary
python -m cProfile -s cumtime my_script.py 2>&1 | head -20
# tracemalloc (add to script, or use wrapper)
python -c "
import tracemalloc, my_module
tracemalloc.start()
my_module.run()
current, peak = tracemalloc.get_traced_memory()
print(f'peak_kb={peak/1024:.1f}')
"
# → parse: peak_kb= line (lower is better)
# memory_profiler (pip install memory_profiler)
python -m memory_profiler my_script.py
# pass rate
pytest --tb=no -q 2>&1 | tail -1
# → parse: "X passed" (higher is better)
# coverage
pytest --cov=src --cov-report=term-missing --tb=no -q 2>&1 | grep "TOTAL"
# → parse: last percentage (higher is better)
# duration
pytest --tb=no -q 2>&1 | grep "passed"
# → parse duration from summary line (lower is better)
# count type errors (lower is better)
uv run ty check 2>/dev/null | grep -c "^error\["
# exit code only: 0 = clean, 1 = errors
uv run ty check 2>/dev/null; echo "exit=$?"
# query duration via psql (wrap your query in EXPLAIN ANALYZE)
psql $DATABASE_URL -c "EXPLAIN (ANALYZE, FORMAT JSON) <your query>" \
| python -c "import sys,json; d=json.load(sys.stdin); print(d[0]['Execution Time'])"
# → parse: Execution Time (lower is better)
# if using pgbench
pgbench -c 5 -T 30 $DATABASE_URL 2>&1 | grep "tps ="
# → parse: tps value (higher is better)
# training run — capture final metric from stdout
python train.py --epochs 10 2>&1 | grep "val_loss" | tail -1
# → parse: val_loss= value (lower is better)
# sklearn cross-validation
python -c "
from sklearn.model_selection import cross_val_score
import numpy as np, my_model, my_data
X, y = my_data.load()
scores = cross_val_score(my_model.build(), X, y, cv=5, scoring='f1_macro')
print(f'f1={np.mean(scores):.4f}')
"
# → parse: f1= (higher is better)
Run PySpark scripts locally against a remote Databricks cluster using
databricks-connect. The cluster does the heavy lifting; the local process is
your experiment coordinator. See the spark-connect skill for setup details.
# Time a PySpark script end-to-end (local process + cluster execution)
hyperfine --warmup 1 'uv run python notebooks/my_pipeline.py'
# → parse: "mean" field (lower is better)
# Capture a row count or computed metric from Spark SQL
uv run python -c "
from utils.databricks import get_spark_session
spark = get_spark_session()
result = spark.sql('''
SELECT COUNT(*) as n FROM my_catalog.my_schema.my_table
WHERE condition = true
''').collect()[0]['n']
print(f'metric={result}')
"
# → parse: metric= (direction depends on goal)
# Capture MLflow metric from a training run
uv run python -c "
import mlflow
client = mlflow.tracking.MlflowClient()
run = client.search_runs(['<experiment_id>'], order_by=['metrics.val_f1 DESC'], max_results=1)[0]
print(f'val_f1={run.data.metrics[\"val_f1\"]:.4f}')
"
# → parse: val_f1= (higher is better)
Autoresearch loop model with Spark Connect:
Local machine (autoresearch loop)
└─ edits notebook / script
└─ commits the change
└─ runs: uv run python notebooks/my_script.py
└─ get_spark_session() → Spark Connect → Databricks Cluster
└─ reads metric from stdout / MLflow
└─ keep or revert based on metric
└─ next iteration
Constraints to consider for Spark runs:
hyperfine --warmup 1 (not 3+) to avoid excessive cluster cost2× baseline is fine for inter-cluster runsspark.sql(...) over PySpark DataFrame API for readability and planabilityDATABRICKS_CLUSTER_ID or DATABRICKS_PROFILE are not set, fail fast with
a clear message rather than hangingFilename: autoresearch-<run-id>.tsv (e.g. autoresearch-pytest-passrate.tsv)
Tab-separated, 5 columns:
experiment commit metric status description
0 a1b2c3d 142.3 baseline unmodified code
1 b2c3d4e 138.1 keep replace list comprehension with generator
2 c3d4e5f 145.0 discard switch to numpy vectorization (slower on small data)
3 d4e5f6g 0.0 crash add numba jit (import error, unfixable)
4 e5f6g7h 131.4 keep cache repeated db lookups with lru_cache
5 f6g7h8i 128.9 inspector-reject inline dict — metric +2% but +40 LOC unmaintainable
| Principle | Why it matters |
|---|---|
| Measure everything | An unmeasured change is a guess. Every experiment has a number. |
| Revert failures | The branch tells the true story — only improvements survive. |
| Stay autonomous | Stopping to ask breaks the loop. Think harder instead. |
| Simplicity costs | Every line added is future maintenance. Weigh it honestly. |
| Log everything | The TSV is the research journal. Future you will thank present you. |