By wjgoarxiv
Run Karpathy-style autonomous research loops across any domain: propose hypotheses, execute experiments, mechanically evaluate results, keep improvements, discard failures, and iterate until a target metric or budget is exhausted. Includes structured debugging, security auditing, adversarial deliberation, scenario stress-testing, and a deployment pipeline with human-gated final release.
Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.
Scientific bug hunting using falsifiable hypotheses. Forms hypotheses, designs falsifying tests, eliminates candidates systematically, and logs the full investigation trail in a structured debug/ folder. TRIGGER when: user has a bug to investigate scientifically; user wants systematic root-cause analysis; user says "debug", "investigate", "root cause", "why is this failing"; user invokes /autoresearch:debug. DO NOT TRIGGER when: user wants to optimize a metric (use /autoresearch); user wants to fix a known error automatically (use /autoresearch:fix); user just wants a quick one-line answer about what a function does.
Iterative error-crusher loop that auto-stops at 0 errors. Cascade-aware: fixes dependency errors before their dependents. Refuses anti-patterns that hide errors instead of fixing them. TRIGGER when: user has errors or failures to fix iteratively; user asks to "fix all errors"; user has a failing test suite; user has compilation errors; user has linter errors; user wants systematic error elimination; user invokes /autoresearch:fix. DO NOT TRIGGER when: user wants a one-shot fix for a single obvious bug; user wants debugging guidance only; user wants code review without fixing.
Feedback-driven self-improvement protocol for autoresearch-skill. Converts failed runs, confusing transcripts, bad outputs, or user feedback into a bounded improvement plan, an eval scenario, and a patch checklist without executing the patch automatically. TRIGGER when: user says the skill failed, wants to improve autoresearch-skill, provides feedback on confusing behavior, asks to turn a failure into an eval, or invokes /autoresearch:learn. DO NOT TRIGGER when: the user wants to run an optimization loop (use /autoresearch), fix code errors (use /autoresearch:fix), or ship a finished artifact (use /autoresearch:ship).
7-step setup wizard that produces a complete, ready-to-run research.md without executing the research loop. Walks the user through goal, metric, search space, constraints, evaluator design, and baseline measurement, then writes the file. TRIGGER when: user wants to set up a research project; user wants to plan before running the loop; user says "plan my research"; user has a goal but no research.md; user invokes /autoresearch:plan. DO NOT TRIGGER when: research.md already exists and the user wants to run the loop; user wants a one-shot answer; user wants to debug, not optimize.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.

Define a goal. Let the agent research, experiment, and iterate -- autonomously.
When to Use · Quick Start · Features · Usage · 한국어
| Example | Result | Iterations | Evaluator | |
|---|---|---|---|---|
| 1 | Code Optimization — Sort 1M integers faster | 2.12s → 0.15s (−93%) | 8 | benchmark.py |
| 2 | Function Fitting — Discover hidden math function | RMSE 2.11 → 0.030 (−99%) | 8 | evaluate.py |
| 3 | Skill Elaboration — Improve P&ID analysis skill | 0.28 → 0.98 composite (+255%) | 2 | evaluate.py |
| 4 | Literature Review — Exercise timing papers | 1/8 → 8/8 categories, 19 papers | 4 | Agent (Tier 2) |
[!NOTE] An LLM skill that turns natural-language research goals into autonomous experiment-evaluate-iterate loops -- inspired by Karpathy's autoresearch. Write a
research.md, and the agent handles hypothesis generation, experimentation, evaluation, and iteration. Works with Claude Code, Codex CLI, OpenCode, and Gemini CLI.
Each run leaves behind human-readable reports, machine-readable logs, and visual evidence. These examples are checked into the repo so you can see the shape of a completed autoresearch loop before running your own.
| Example | Goal | Metric | Before → After | Iterations | Visual preview | Artifacts |
|---|---|---|---|---|---|---|
| Code Optimization | Sort 1M integers faster | median runtime ↓ | 2.12s → 0.15s | 8 | results.png | research.md, autoresearch-results.tsv, final_report.md |
| Function Fitting | Recover an unknown function from data | RMSE ↓ | 2.11 → 0.030 | 8 | results.png | train_data.csv, test_data.csv, evaluate.py, final_report.md |
| Skill Elaboration | Improve a PDF/P&ID analysis skill | structural score ↑ | 0.28 → 0.98 | 2 | results.png | original/improved SKILL.md, evaluate.py, final_report.md |
| Literature Review | Fill exercise-timing literature coverage gaps | categories covered ↑ | 1/8 → 8/8 | 4 | results.png | research_log.md, autoresearch-results.tsv, final_report.md |
Typical final directory shape:
my-research/
├── research.md # living state + iteration history
├── research_log.md # append-only reasoning and evidence log
├── autoresearch-results.tsv # machine-readable metric table
├── progress.png # convergence plot refreshed during runs
└── final_report.md # final result, failures, and next steps
research.md is your program: define goals, metrics, and constraints in plain Englishresearch_log.md with timestamps, changes, and resultsnpx claudepluginhub wjgoarxiv/autoresearch-skillMulti-agent research conference with 7 commands
Autonomous, personalized research loops for Claude Code. Set a topic, walk away, come back to a quality-gated report adapted to your projects.
Autonomous experiment loops on any codebase — one file, one metric, one loop. Based on Karpathy's autoresearch pattern.
Autonomous experimentation skill — your AI coding agent designs experiments, tests hypotheses, discards failures, keeps wins. Runs overnight while you sleep.
Oh My Paper research harness: memory system, Codex delegation, and pipeline commands for academic research projects.
Run Arbor's hypothesis-tree research workflow inside Claude Code using your own model — no API key. Bundles the Arbor Agent Skill suite and registers the keyless `arbor mcp` tools (Idea Tree, evaluation, worktrees, guarded merges, reports).
Scientific research agent extension - turns research goals into reproducible Jupyter notebooks with Python REPL, data analysis, and ML workflows