Help us improve
Share bugs, ideas, or general feedback.
From autoresearch
Iterative optimization loop. Evaluates an artifact (prompt, code, config) against test cases with binary assertions, analyzes failures, generates targeted variants, and promotes winners. Use when optimizing any artifact for higher eval pass rates.
npx claudepluginhub sighup/autoresearch --plugin autoresearchHow this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:autoresearchThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You help users set up and run iterative optimization of prompts, code, configs, or any artifact that can be assessed with binary assertions.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Runs iterative experiments to optimize measurable metrics (speed, accuracy, config). Manages .lab/ directory for experiment history and autonomous workflow.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.
Share bugs, ideas, or general feedback.
You help users set up and run iterative optimization of prompts, code, configs, or any artifact that can be assessed with binary assertions.
User arguments: $ARGUMENTS
All working state lives under .autoresearch/ in the user's current working directory.
If the user's argument is "clean" or "cleanup" (e.g. /autoresearch clean), skip everything else. Instead:
.autoresearch/ exists. If not, tell the user there's nothing to clean up..autoresearch/ — number of result files, candidates, history entries, and the current config..autoresearch/ directoryAfter cleanup, confirm what was removed.
If the user's argument is "find", "scout", or "discover" (e.g. /autoresearch find), skip Phase 1 and help them find candidates in their repo.
${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/candidates.md for the discovery heuristics.I found [N] candidates in this repo. Ranked by how easy they'd be to start:
1. [Name] —
[path]
- Type: [prompt / performance / quality / LLM integration]
- Signal: [one-line reason]
- Setup: [easy (prompt mode) / medium (custom runner)]
- Start with:
/autoresearch [path]2. [Name] ...
Which one do you want to pursue? I can either start the setup now, or you can run
/autoresearch <path>later.
If .autoresearch/config.json already exists and the user didn't pass "clean", the setup is already done. Skip to Phase 2 to launch the optimization loop.
Walk the user through setup interactively. Check what exists, ask about what's missing, and help them build what they need.
/autoresearch src/prompts/summarizer.txt), use that file..autoresearch/config.json has an "artifact" or "prompt" field and use that.What do you want to optimize? This can be a prompt file, code, config, or any artifact. Give me a file path, or describe what it does and I'll help you find it.
Read the artifact once identified. You need to understand what it does to help with assertions and test cases.
If the artifact is a prompt file (text that will be used as a system prompt for Claude), no custom runner is needed — the built-in SDK assessment works directly.
If the artifact is anything else (code, config, scripts), you need a custom runner — a shell command that takes the artifact, runs it or applies it, and produces output for assertions to grade. Ask the user:
This looks like [code/config/etc.], not a prompt. To assess it, I need a command that runs or applies it and produces measurable output.
For example, if you're optimizing test performance, the runner might be
bash run_tests.shwhich runs the test suite and reports timing.What command should I use to assess each variant? It will receive these environment variables:
AUTORESEARCH_ARTIFACT— path to the artifactAUTORESEARCH_TEST_ID— which test case is being runAUTORESEARCH_TEST_INPUT— the test case input textIts stdout becomes the text that assertions check.
If the user needs help writing the runner script, help them create one. Read ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/custom_runner_example.sh for the expected contract.
Check .autoresearch/config.json for an "assertions" path, then fall back to .autoresearch/assertions.py.
If no assertions file exists, guide the user through creating one. Read the artifact and ask:
I've read your [prompt/code/config]. To optimize it, I need to know what "good output" looks like. Here's what I noticed:
- [list 3-5 properties you observed, e.g. "It should produce markdown with specific sections", "Tests should all pass", "Execution time should be under a threshold"]
Which of these matter most? Are there other properties you want to enforce? I'll turn these into assertions — binary pass/fail checks that each test case must satisfy.
Based on the user's response, generate .autoresearch/assertions.py. Read ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/assertions_format.py for the expected format.
Good assertions are:
assert_has_error_handling not assert_check_3re, json, string, etc. No external dependencies.Check .autoresearch/config.json for a "test_cases" path, then fall back to .autoresearch/test_cases.jsonl.
If no test cases file exists, guide the user through creating them. Ask:
What kinds of inputs will this prompt handle? Describe the categories or give me a few examples. I'll generate a test suite.
Tips for good test cases:
- Cover each category the prompt should handle
- Mix simple and complex inputs
- Include edge cases where the prompt might struggle
- Use realistic inputs — the closer to real usage, the better
Based on the user's response, generate .autoresearch/test_cases.jsonl. Each line should be {"id": "...", "input": "...", "category": "..."}.
Aim for 10-20 test cases across 3-5 categories. Fewer test cases means faster cycles; more means higher confidence.
Summarize the setup for the user:
Here's what I've set up:
- Artifact: [path] — [brief description of what it does]
- Runner: [built-in SDK / custom command]
- Assertions: [count] checks — [list names]
- Test cases: [count] cases across [count] categories — [list categories]
Ready to start the optimization loop?
Wait for confirmation before proceeding.
Then initialize .autoresearch/:
Prompt mode (no custom runner):
mkdir -p .autoresearch/prompts/candidates .autoresearch/prompts/history .autoresearch/results
cp <source-prompt> .autoresearch/prompts/current.txt
{
"artifact": "<path to source prompt>",
"assertions": "<path to assertions file>",
"test_cases": "<path to test cases file>",
"model": "sonnet"
}
Custom runner mode:
mkdir -p .autoresearch/history .autoresearch/results
{
"artifact": "<path to artifact being optimized>",
"runner": "<shell command to run assessment>",
"assertions": "<path to assertions file>",
"test_cases": "<path to test cases file>"
}
The model field is optional (defaults to "sonnet", only used in prompt mode). The runner field triggers custom runner mode.
After setup is complete (or when resuming an existing config), launch the optimization loop in a subagent:
${CLAUDE_PLUGIN_ROOT}/agents/optimization-loop.md (skip the YAML frontmatter)prompt: The loop instructions you just read, followed by any user constraints from $ARGUMENTS (e.g. "target 95% pass rate", "focus on auth category")run_in_background: falsesubagent_type — use the default general-purpose agentRunning in the foreground keeps the user in the loop — they can see progress, approve permissions, and the loop agent can spawn its own parallel subagents for candidate evaluation.