Skill

autoresearch

Iterative optimization loop. Evaluates an artifact (prompt, code, config) against test cases with binary assertions, analyzes failures, generates targeted variants, and promotes winners. Use when optimizing any artifact for higher eval pass rates.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autoresearch:autoresearch

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Bash(autoresearch-runner*) Bash(cp *) Bash(rm .autoresearch/*) Bash(mv .autoresearch/*) Bash(mkdir *) Bash(git *) Agent Read Write Edit Glob Grep AskUserQuestion

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You help users set up and run iterative optimization of prompts, code, configs, or any artifact that can be assessed with binary assertions.

Supporting Files

references/assertions_format.pyreferences/candidates.mdreferences/custom_runner_example.sh

SKILL.md

192 lines · ~2.2k tokens

Stats

LanguagePython

Stars1

MaintenanceExcellent

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

AutoResearch: Iterative Optimization

You help users set up and run iterative optimization of prompts, code, configs, or any artifact that can be assessed with binary assertions.

User arguments: $ARGUMENTS

All working state lives under .autoresearch/ in the user's current working directory.

Cleanup mode

If the user's argument is "clean" or "cleanup" (e.g. /autoresearch clean), skip everything else. Instead:

Check if .autoresearch/ exists. If not, tell the user there's nothing to clean up.
Show the user what's in .autoresearch/ — number of result files, candidates, history entries, and the current config.
Ask what they want to do:
- Remove everything — delete the entire .autoresearch/ directory
- Keep assertions and test cases — delete prompts/, results/, and config.json but preserve assertions.py and test_cases.jsonl (useful for re-running later with a different prompt)
- Cancel — do nothing

After cleanup, confirm what was removed.

Discovery mode

If the user's argument is "find", "scout", or "discover" (e.g. /autoresearch find), skip Phase 1 and help them find candidates in their repo.

Read ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/candidates.md for the discovery heuristics.
Scan the repo using those heuristics:
- Use Glob for file pattern matches (prompts, configs, CI files)
- Use Grep for inline prompt detection and slowness signals
- Check for existing eval infrastructure
Rank candidates by setup effort (easy first) and signal strength. Cap at 5-8 results.
Present findings in this format:

I found [N] candidates in this repo. Ranked by how easy they'd be to start:

1. [Name] — [path]

Type: [prompt / performance / quality / LLM integration]

Signal: [one-line reason]

Setup: [easy (prompt mode) / medium (custom runner)]

Start with: /autoresearch [path]

2. [Name] ...

Which one do you want to pursue? I can either start the setup now, or you can run /autoresearch <path> later.

If the user picks one, proceed to Phase 1 with that artifact. Otherwise, end — they can come back when ready.

Resume mode

If .autoresearch/config.json already exists and the user didn't pass "clean", the setup is already done. Skip to Phase 2 to launch the optimization loop.

Phase 1: Guided Setup

Walk the user through setup interactively. Check what exists, ask about what's missing, and help them build what they need.

Step 1: Identify the artifact

If the user passed a file path as the first argument (e.g. /autoresearch src/prompts/summarizer.txt), use that file.
Otherwise, check if .autoresearch/config.json has an "artifact" or "prompt" field and use that.
If neither, ask the user:

What do you want to optimize? This can be a prompt file, code, config, or any artifact. Give me a file path, or describe what it does and I'll help you find it.

Read the artifact once identified. You need to understand what it does to help with assertions and test cases.

Step 1b: Determine if a custom runner is needed

If the artifact is a prompt file (text that will be used as a system prompt for Claude), no custom runner is needed — the built-in SDK assessment works directly.

If the artifact is anything else (code, config, scripts), you need a custom runner — a shell command that takes the artifact, runs it or applies it, and produces output for assertions to grade. Ask the user:

This looks like [code/config/etc.], not a prompt. To assess it, I need a command that runs or applies it and produces measurable output.

For example, if you're optimizing test performance, the runner might be bash run_tests.sh which runs the test suite and reports timing.

What command should I use to assess each variant? It will receive these environment variables:

AUTORESEARCH_ARTIFACT — path to the artifact

AUTORESEARCH_TEST_ID — which test case is being run

AUTORESEARCH_TEST_INPUT — the test case input text

Its stdout becomes the text that assertions check.

If the user needs help writing the runner script, help them create one. Read ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/custom_runner_example.sh for the expected contract.

Step 2: Identify or create assertions

Check .autoresearch/config.json for an "assertions" path, then fall back to .autoresearch/assertions.py.

If no assertions file exists, guide the user through creating one. Read the artifact and ask:

I've read your [prompt/code/config]. To optimize it, I need to know what "good output" looks like. Here's what I noticed:

[list 3-5 properties you observed, e.g. "It should produce markdown with specific sections", "Tests should all pass", "Execution time should be under a threshold"]

Which of these matter most? Are there other properties you want to enforce? I'll turn these into assertions — binary pass/fail checks that each test case must satisfy.

Based on the user's response, generate .autoresearch/assertions.py. Read ${CLAUDE_PLUGIN_ROOT}/skills/autoresearch/references/assertions_format.py for the expected format.

Good assertions are:

Structural — Check that sections, formats, or patterns exist (not subjective quality)
Binary — Unambiguous pass or fail
Independent — Each tests one thing
Named clearly — assert_has_error_handling not assert_check_3
Pure stdlib — Only use re, json, string, etc. No external dependencies.

Step 3: Identify or create test cases

Check .autoresearch/config.json for a "test_cases" path, then fall back to .autoresearch/test_cases.jsonl.

If no test cases file exists, guide the user through creating them. Ask:

What kinds of inputs will this prompt handle? Describe the categories or give me a few examples. I'll generate a test suite.

Tips for good test cases:

Cover each category the prompt should handle

Mix simple and complex inputs

Include edge cases where the prompt might struggle

Use realistic inputs — the closer to real usage, the better

Based on the user's response, generate .autoresearch/test_cases.jsonl. Each line should be {"id": "...", "input": "...", "category": "..."}.

Aim for 10-20 test cases across 3-5 categories. Fewer test cases means faster cycles; more means higher confidence.

Step 4: Confirm and initialize

Summarize the setup for the user:

Here's what I've set up:

Artifact: [path] — [brief description of what it does]

Runner: [built-in SDK / custom command]

Assertions: [count] checks — [list names]

Test cases: [count] cases across [count] categories — [list categories]

Ready to start the optimization loop?

Wait for confirmation before proceeding.

Then initialize .autoresearch/:

Prompt mode (no custom runner):

mkdir -p .autoresearch/prompts/candidates .autoresearch/prompts/history .autoresearch/results
cp <source-prompt> .autoresearch/prompts/current.txt

{
  "artifact": "<path to source prompt>",
  "assertions": "<path to assertions file>",
  "test_cases": "<path to test cases file>",
  "model": "sonnet"
}

Custom runner mode:

mkdir -p .autoresearch/history .autoresearch/results

{
  "artifact": "<path to artifact being optimized>",
  "runner": "<shell command to run assessment>",
  "assertions": "<path to assertions file>",
  "test_cases": "<path to test cases file>"
}

The model field is optional (defaults to "sonnet", only used in prompt mode). The runner field triggers custom runner mode.

Phase 2: Launch Optimization Loop

After setup is complete (or when resuming an existing config), launch the optimization loop in a subagent:

Read the loop instructions from ${CLAUDE_PLUGIN_ROOT}/agents/optimization-loop.md (skip the YAML frontmatter)
Use the Agent tool with:
- prompt: The loop instructions you just read, followed by any user constraints from $ARGUMENTS (e.g. "target 95% pass rate", "focus on auth category")
- run_in_background: false
- Do NOT set subagent_type — use the default general-purpose agent

Running in the foreground keeps the user in the loop — they can see progress, approve permissions, and the loop agent can spawn its own parallel subagents for candidate evaluation.

autoresearch

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

autoresearch

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

AutoResearch: Iterative Optimization

Cleanup mode

Discovery mode

Resume mode

Phase 1: Guided Setup

Step 1: Identify the artifact

Step 1b: Determine if a custom runner is needed

Step 2: Identify or create assertions

Step 3: Identify or create test cases

Step 4: Confirm and initialize

Phase 2: Launch Optimization Loop

Similar Skills

AutoResearch: Iterative Optimization

Cleanup mode

Discovery mode

Resume mode

Phase 1: Guided Setup

Step 1: Identify the artifact

Step 1b: Determine if a custom runner is needed

Step 2: Identify or create assertions

Step 3: Identify or create test cases

Step 4: Confirm and initialize

Phase 2: Launch Optimization Loop

Similar Skills