From researcher
Orchestrates autonomous experiments to optimize measurable metrics like build time, latency, accuracy, or configs via git branches and .lab/ logging.
npx claudepluginhub krzysztofdudek/researcherskill --plugin researcherThis skill uses the workspace's default tool permissions.
<critical>
Sets up Karpathy-style autoresearch experiments to autonomously optimize code in one constrained file via iterative evals against a numerical metric, generating instructions.md, eval script, test data, and launch prompt.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.
Runs metric-driven iterative optimization loops for code performance, prompts, clustering, search relevance, or other metrics. Builds measurement scaffolding, runs parallel experiments, evaluates via hard gates/LLM judges, iterates to best solution.
Share bugs, ideas, or general feedback.
You are entering researcher mode. This skill is for YOU — the main agent. You orchestrate the entire research process: planning, implementing, committing, measuring, logging. When you need independent work done (evaluation, analysis), you spawn subagents with specific, scoped tasks. You control what each subagent knows through the prompt you give it.
You have complete freedom in how you navigate the problem space. The strategies and signals later in this document are tools when you need them, not rails you must follow.
.lab/ is Sacred.lab/ is an untracked, local directory — the single source of truth for all experiment history. It survives all git operations because it is in .gitignore. Git manages code state. .lab/ manages experiment knowledge. They are independent.
Structure:
.lab/config.md, results.tsv, log.md, branches.md, parking-lot.md — experiment metadata.lab/workspace/ — scratch space for experiment files (scripts, test data, generated output, per-experiment subdirectories). Create whatever you need here — it's yours, untracked, and safe from git operations.Always protect .lab/. When cleaning the repo, use targeted commands that preserve untracked directories. When resetting, use git reset and git checkout which leave .lab/ intact.
Check if .lab/ already exists in the project root.
If it exists:
.lab/config.md, .lab/results.tsv, .lab/branches.md, and tail of .lab/log.md.lab.bak.<timestamp>/, proceed to Phase 1If it does not exist: proceed to Phase 1.
Before any experiment, understand the problem. Ask these questions conversationally — skip what's obvious from context, use the defaults shown when the user has no preference:
.claudeignore, .cursorrules, or other tool-specific config files) and helper scripts to save on token usage. If yes, then what agentic ecosystem are we using?.lab/ persists — the next session resumes via Phase 0.
Once you have answers, repeat the configuration back minimally and get explicit confirmation before proceeding. Use a compact table. If something is default, say “default” rather than restating details.
After confirmation:
research/<slug> from current HEAD..lab/ in the project root..lab/config.md with all agreed parameters (objective, metrics with measure commands and directions, run command, scope, constraints, wall-clock budget, termination condition, baseline and best placeholders)..lab/results.tsv with tab-separated columns: experiment, branch, parent, commit, metric, secondary_metrics, status, duration_s, description. Status values: keep, discard, crash, thought, keep*, interesting..lab/log.md.lab/parking-lot.md for deferred ideas.lab/branches.md with columns: Branch, Forked from, Status, Experiments, Best metric, Notes.lab/workspace/ for scratch files (scripts, test data, generated output). Use per-experiment subdirectories (e.g., .lab/workspace/exp-3/) when needed..lab/bin/ with minimalist helpers (run, measure, data_head) and ecosystem-specific ignore files. Follow the [Token Hygiene Standards] (#token-hygiene-standards) below..lab/ and run.log to .gitignore.THINK — Before anything, read: .lab/results.tsv, .lab/log.md (last 5 entries if 20+), .lab/branches.md, .lab/parking-lot.md, and in-scope source files. Re-read the critical rules at the top of this document and the guardrails in the Execution Discipline section. Then write a ## THINK — before Experiment N entry in .lab/log.md covering:
The log entry is mandatory — it is the evidence that you stopped to think. Without it, the THINK phase didn't happen. Stay as long as productive.
TEST — Implement, run, measure. Verify hypotheses. Follow execution discipline (below). Stay as long as you're generating new data.
REFLECT — What confirmed? What surprised? What breaks your model? Log everything. Update parking lot.
Repo-file experiments modify any file in scope (as defined in config). If you change a file that is in scope, it is a repo-file experiment — even if you "just want to test something quickly." No exceptions.
Lab-only experiments only touch .lab/ or files outside scope. The commit rules below apply to repo-file experiments. Lab-only experiments just need logging.
For every real experiment (code change + run):
Commit BEFORE running (repo-file experiments only):
experiment #{N}: {short description}
Branch: {research branch name}
Parent: #{parent experiment number}
Hypothesis: {one-line hypothesis}
Next experiment number = highest experiment in .lab/results.tsv + 1. Keeps stay on the branch as permanent checkpoints. Discards are reset — their SHA is recorded in results.tsv and remains accessible until git gc runs. Fork from discarded SHAs sooner rather than later.
Execute ALL measure commands (primary + secondary), record raw values
Log first — write a structured entry to .lab/log.md and a row to .lab/results.tsv (including the commit SHA). This must happen before any reset.
Then decide:
git reset --hard HEAD~1. The commit disappears from the branch but its SHA is in .lab/results.tsv. Want to revisit a discarded idea? Fork a new branch from that SHA.git reset --hard HEAD~1. Only read last 50 lines of run.log or grep for patterns.
Guardrails (after every decide/reset):
## 3-Discard Guardrail — after Experiment N entry in .lab/log.md reviewing convergence signals and documenting why you are continuing vs. forking. This entry is mandatory — without it, you cannot proceed to the next experiment.## 5-Discard Fork — after Experiment N entry in .lab/log.md. Before forking, check .lab/parking-lot.md — if there are untested ideas there, try one first. Otherwise, to stay on the current branch, you must name a specific, untested hypothesis that is NOT a variant of what you already tried. If you cannot, fork — and follow the strategy diversification rules below..lab/log.md as ## Re-Validation after Experiment N. If regressed >2%, log drift and consider forking from the best experiment. This is mandatory — do not skip.For every thought experiment: Log with status thought in both files.
Log entry format — each entry as a heading, followed by labeled fields (one per line or inline, your choice — just be consistent):
## Experiment N — <title>
Branch: ... / Type: thought|real / Parent: #M
Hypothesis: ...
Changes: ...
Result: ...
Duration: ...
Status: keep|discard|crash|thought|keep*|interesting
Insight: ...
Default: complete autonomy. You do not return to the user with progress updates. You work, you log, the user observes.
Consult the user ONLY when:
When the user intervenes: accept the direction, log the intervention, continue.
The experiment history is non-linear. Fork branches to explore divergent approaches.
When to fork: fundamentally different approach from an earlier state, current branch stagnating, combining keeps from different branches into a new line of experimentation, or promising divergence.
How to fork:
git log --oneline --grep="experiment #N:". For discards: find the SHA in .lab/results.tsv.git checkout <SHA> → git checkout -b research/<descriptive-slug>.lab/branches.md (the "Forked from" column tracks genealogy — branch names don't need to encode it)Always consider results from ALL branches when thinking. Mark exhausted branches as closed in .lab/branches.md.
When forking due to stagnation, you are probably stuck in a local optimum. Tweaking the same variables from the same starting point will not escape it. Before creating the fork:
.lab/log.md: what does the current best strategy assume? (e.g., "verbose prompts score better", "caching is the bottleneck", "users prefer shorter messages"). These are your current priors.research/low-alpha-approach not research/tweak-delta).When the current metric is flawed — dimensions are unmeasurable from output, scale doesn't differentiate quality, or rubric misses what actually matters — revise it mid-series:
.lab/log.md, describe what is wrong with the current metric and why (e.g., which dimensions always score neutral, what the metric fails to capture).lab/config.md, add a ## Metric v2 section (keep v1 intact). Include: date, what changed, rationale for each dropped/added/modified dimensionresults.tsv is meaningless because you cannot tell whether improvement came from the experiment or the metric changeresults.tsv with a version suffix on the experiment number (e.g., 2v2 for experiment #2 re-scored under metric v2). Original rows stay untouched for auditMetric revision is expensive (re-scoring every keep), so do it once and get it right. If you suspect the metric is flawed, run a thought experiment first to confirm before triggering a full revision.
When termination is met or user interrupts:
.lab/summary.md: total experiments, keeps, discards per branch and global; best vs baseline; top 3 impactful changes; branch history; experiment genealogy; key insights; failed approaches; remaining parking lot ideasresearch complete: {short description of best result}.When the primary metric is qualitative, define a rubric in .lab/config.md during Phase 2:
sum(criterion_score × weight)This composite becomes the quantitative proxy. Log it in results.tsv with per-criterion scores in log entries.
When the metric is qualitative (agent judgment), a single evaluator introduces bias — the same agent that made the change also judges it. To counteract this:
.lab/log.md, median in results.tsv.This protocol is mandatory for qualitative metrics. Quantitative metrics (command output) do not need it.
Tools when you're stuck, not a menu to follow. You have complete freedom to invent your own.
| Strategy | When it helps |
|---|---|
| Ablation — remove something | Unsure what's actually helping |
| Amplification — push what works further | After a keep |
| Combination — merge wins from separate experiments | Multiple keeps in different areas |
| Inversion — try the opposite | String of discards |
| Isolation — change one variable | Unclear what helped |
| Analogy — borrow from adjacent domains | Truly stuck |
| Simplification — remove complexity, preserve metric | Accumulated cruft |
| Scaling — change by order of magnitude | Small tweaks plateaued |
| Decomposition — split big change into parts | Promising change discarded |
| Sweep — test parameter across a range | Right value unknown |
| Signal | Meaning |
|---|---|
| 5+ discards in a row | Current approach exhausted |
| Thought experiments repeating | Go empirical |
| Results consistently confirm theory | Go deeper |
| Results contradict theory | Model is wrong — rethink |
| Metric plateau (<0.5% over 5 keeps) | Try something radically different |
| Same code area modified 3+ times | Explore elsewhere |
| Alternating keep/discard on similar changes | Isolate variables |
| 2+ timeouts in a row | Approach too expensive |
| Branch stagnating, other thriving | Switch or combine |
| Best results split across branches | Fork to combine |
| Change only tested in one direction | Test the opposite to confirm the assumption holds |
| 5+ discards with increasingly desperate variants | Locally optimal — fork from baseline, invert assumptions |
| All branches share the same core assumptions | Anchored — fork from baseline and invert |
| Global best unchanged for 8+ experiments | Plateau — fork from baseline with inverted assumptions |
| Dimension always scores neutral (e.g., 5/10) | Dimension unmeasurable — consider metric revision |
.lab/bin/)tr -d '[:space:]' to strip invisible characters.head for text. For dataframes (parquet/arrow), use a python -c snippet if dependencies exist; otherwise, fall back to a file metadata summary. Never dump raw binary..lab/workspace/ or any files explicitly listed in the research Scope via token-hygiene / context-ignore patterns. This is separate from version-control ignores (e.g., it is still correct to add .lab/ to .gitignore).