AutoResearch
A Claude Code plugin for iterative optimization through automated evaluation.
AutoResearch evaluates an artifact (prompt, code, config, or anything else) against a test suite, analyzes failures, generates targeted variants, and promotes winners — repeating until it hits your target pass rate.
How It Works
Each optimization cycle:
- Assess — Run the current artifact against all test cases using binary assertions
- Analyze — Identify which assertions fail most and what patterns cause failures
- Generate — Create 3 candidate variants, each changing exactly ONE thing
- Compare — Assess all candidates against the full test suite
- Promote — If a candidate beats the current best, it becomes the new baseline
- Repeat — Continue until pass rate exceeds 90% or 15 cycles are exhausted
Setup
Install the plugin
In Claude Code, add this repo as a plugin marketplace, then install:
/plugin marketplace add sighup/autoresearch
/plugin install autoresearch@autoresearch
For local development, point Claude Code at your clone:
/plugin marketplace add /path/to/autoresearch
/plugin install autoresearch@autoresearch
Prerequisites
- Python 3.10+
- uv (for automatic dependency management)
- An
ANTHROPIC_API_KEY environment variable (only required for prompt mode — not needed when using a custom runner)
The Agent SDK is installed automatically into .autoresearch/.venv on first run when using prompt mode.
Configure your optimization target
You need three things (and optionally a fourth):
1. An artifact
The thing you want to optimize — a prompt file, source code, config, or any file. It can live anywhere in your project.
2. Test cases
A JSONL file with one test case per line. Each line is a JSON object with id, input, and category:
{"id": "api-health", "input": "Add a /health endpoint to our Express.js API that returns server status and uptime.", "category": "api"}
{"id": "cli-export", "input": "Add a --format flag to our CLI tool for JSON and CSV export.", "category": "cli"}
3. Assertions
A Python file defining binary assertion functions. Each function takes the runner's output as a string and returns True or False. Register them in an ASSERTIONS list:
import re
def assert_has_summary(response: str) -> bool:
"""Response contains a Summary section."""
return bool(re.search(r"## Summary", response, re.IGNORECASE))
def assert_min_length(response: str) -> bool:
"""Response is at least 500 characters."""
return len(response.strip()) >= 500
ASSERTIONS = [
assert_has_summary,
assert_min_length,
]
4. A custom runner (optional)
For non-prompt artifacts, provide a shell command that assesses your artifact. It receives context via environment variables:
AUTORESEARCH_ARTIFACT — path to the artifact being optimized
AUTORESEARCH_TEST_ID — test case ID
AUTORESEARCH_TEST_INPUT — test case input text
Its stdout becomes the response text that assertions grade. Exit 0 on success; non-zero is treated as an error.
Concurrency requirement: Your runner may be invoked concurrently for different test cases (one subprocess per test case, running simultaneously). Use the AUTORESEARCH_TEST_ID environment variable to isolate per-run state — write to test-specific temp directories, use separate database transactions, etc. If your runner cannot handle concurrent invocation, set "parallel": false in your config (see below).
These files can live anywhere in your project. Point to them from .autoresearch/config.json:
Prompt mode (default):
{
"artifact": "src/prompts/summarizer.txt",
"assertions": "tests/summarizer_assertions.py",
"test_cases": "tests/summarizer_cases.jsonl"
}
Custom runner mode:
{
"artifact": "pytest.ini",
"runner": "bash ./run_tests_timed.sh",
"assertions": "tests/perf_assertions.py",
"test_cases": "tests/perf_cases.jsonl"
}
Parallelism: Test cases within a variant are assessed concurrently by default in prompt mode, and sequentially in custom runner mode. Override this with the "parallel" config field:
{
"artifact": "pytest.ini",
"runner": "bash ./run_tests_timed.sh",
"parallel": true,
"assertions": "tests/perf_assertions.py",
"test_cases": "tests/perf_cases.jsonl"
}
Usage
/autoresearch # asks for artifact path
/autoresearch find # scan repo for candidates
/autoresearch src/prompts/summarizer.txt # optimize this prompt
/autoresearch src/prompts/summarizer.txt target 95% # with a goal
/autoresearch pytest.ini # optimize non-prompt (will ask for runner)
/autoresearch clean # clean up .autoresearch/