From autoresearch
Autonomous goal-directed iteration loop that continuously improves prompts, templates, configs, or code. Uses four-way separation — eval agent writes eval.py (judge script), test runner produces outputs, judge script scores them, main agent edits the prompt. Inspired by Karpathy's autoresearch.
npx claudepluginhub naveedharri/benai-skills --plugin autoresearchThis skill uses the workspace's default tool permissions.
USE WHEN the user runs `/autoresearch`, says "autoresearch", "optimize this prompt", "improve this overnight", "run an optimization loop", "iterate on this", "auto-improve", or wants to autonomously refine a file against measurable criteria.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Automates semantic versioning and release workflow for Claude Code plugins: bumps versions in package.json, marketplace.json, plugin.json; verifies builds; creates git tags, GitHub releases, changelogs.
USE WHEN the user runs /autoresearch, says "autoresearch", "optimize this prompt", "improve this overnight", "run an optimization loop", "iterate on this", "auto-improve", or wants to autonomously refine a file against measurable criteria.
This skill supports two evaluation modes:
| Role | Who | Knows Eval Code? | Knows Prompt History? |
|---|---|---|---|
| Main Agent | You (optimizer) | NO — reads metric number only | Yes — reads logs, plans changes |
| Eval Agent | autoresearch-eval-agent sub-agent | Yes — writes eval.py | No |
| Test Runner | autoresearch-test-runner sub-agent | NO — fresh context | NO |
| Judge Script | eval.py (deterministic Python) | IS the eval | No |
For creative/subjective tasks, the user can opt into an additional LLM judge tier:
| Role | Who | Knows Eval? | Knows Rubric? | Knows Iterations? |
|---|---|---|---|---|
| Main Agent | You (optimizer) | Metric numbers only | Criteria names only | Yes |
| Eval Agent | autoresearch-eval-agent | Writes eval.py + rubric.md | — | No |
| Test Runner | autoresearch-test-runner | No | No | No |
| Judge Script | eval.py (deterministic) | IS eval | No | No |
| Judge Agent | autoresearch-judge sub-agent | No | Follows rubric | No |
In hybrid mode:
eval.py → deterministic pass_raterubric.md → quality_scorecombined_score = (0.6 × pass_rate) + (0.4 × quality_score) (weights adjustable)combined_scoreKey isolation:
Whenever you need user input or confirmation, use the AskUserQuestion tool. This shows an interactive popup in Cowork.
CRITICAL: Keep popup text SHORT. The popup is small — long text becomes unreadable. Follow this pattern:
Good:
[In chat]: Here are 7 proposed assertions:
1. Covers all 4 search clusters
2. Each entry has title, channel, views
3. Contains opportunity mapping
...
[AskUserQuestion]: "Do these assertions look right?"
Options: "These look good" / "Adjust some" / [free text]
Bad:
[AskUserQuestion]: "Here are 7 proposed assertions for what a 'good' output looks like: 1. Covers all 4 search clusters (Core, Tools, Niche, Competitors)... [giant paragraph]"
Never put lists, explanations, or details inside AskUserQuestion. The popup is for the QUESTION only.
Save it to a file in the working directory (e.g., target-skill.md)
Analyze the content and propose 5-7 binary assertions. Write the list in chat (not in the popup).
Use AskUserQuestion — short text only:
"Do these assertions look right?""These look good" / "Adjust some"SEPARATE STEP — always ask this. Use AskUserQuestion:
"Which evaluation mode?""Deterministic only" / "Hybrid (deterministic + LLM judge)"SEPARATE STEP — always ask this. Use AskUserQuestion:
"How many iterations? (Recommended: 5)""5 (recommended)" / "10" / "20"Spawn the autoresearch-eval-agent sub-agent to generate the eval system:
eval.py and test_cases.json (always)rubric.md with subjective scoring criteriaShow the user the generated eval.py (and rubric.md if hybrid) in chat. Then use AskUserQuestion:
"Does this eval look right?""Looks good" / "Adjust"Mark eval.py, test_cases.json, and rubric.md (if hybrid) as READ-ONLY — you MUST NOT modify them during the loop
Run baseline (iteration 0): generate outputs via test runner (always live — using real tools and data), run eval.py (and judge agent if hybrid), record the score
Start the loop — stop after the chosen iteration count
If the user says "improve this" without assertions, suggest 5 reasonable ones and use AskUserQuestion to confirm.
If the user says "go" or "start", default to deterministic mode + 5 iterations, spawn the eval agent, and begin immediately.
When the user pastes a skill, prompt, or template:
target-skill.md)| Field | Description | Example |
|---|---|---|
| Goal | What are we optimizing? | "Improve cold email reply-rate signals" |
| Metric | Always pass_rate for prompt/skill optimization | pass_rate |
| Direction | Higher is better | higher |
| Assertions | The binary checks (confirmed by user) | "Under 150 words", "Contains a question CTA" |
| Modifiable File | The prompt/skill file ONLY | target-skill.md |
| Guard | Optional regression check | None |
Before any optimization begins, spawn the autoresearch-eval-agent sub-agent:
Design an eval system for the prompt/skill at [path to target skill].
Assertions (confirmed by user):
1. [assertion 1]
2. [assertion 2]
...
The prompt expects these inputs: [list input fields from the prompt].
Save eval.py and test_cases.json to [working directory path].
The eval agent will:
test_cases.json — 10+ realistic, diverse test inputseval.py — deterministic Python with proxy heuristics for each assertionIMPORTANT: You (the optimizer) MUST NOT read the eval.py source code. The eval agent writes it. You only interact with eval.py by running it and reading its stdout output (the metric number and assertion names). This prevents you from knowing the exact heuristics and gaming them.
After the eval agent finishes:
See references/example-eval.py for what a generated eval.py looks like.
outputs/ directoryautoresearch-test-runner sub-agent with this prompt:
Generate outputs using the prompt at [path to target skill].
Test cases are at [path to test_cases.json].
Save each output to outputs/output_00.txt through outputs/output_09.txt (zero-padded index matching test case order).
Follow the prompt exactly. One output per test case. No commentary — just the raw output in each file.
IMPORTANT: Output file naming must be output_XX.txt (zero-padded index). Both the eval agent (when writing eval.py) and the test runner must use this convention. If eval.py uses a different naming scheme, the files won't be found.python eval.py outputs/METRIC pass_rate=X.XXXX line from stdoutautoresearch-log.jsonlautoresearch-dashboard.html (use template from references/dashboard-template.html)Each iteration follows this exact sequence:
autoresearch-log.jsonlautoresearch-ideas.md if it exists[filename].backupreferences/, linked files, imported data). List all file paths the skill depends on.autoresearch-test-runner sub-agent with:
Execute the prompt at [path to target skill].
Test cases are at [path to test_cases.json].
The working project is at [project path].
Reference files the prompt depends on (read these first):
- [path to references/file1.md]
- [path to references/file2.md]
- [... list ALL referenced files]
Use all available tools (web search, file access, APIs) to produce real outputs.
Save each output to outputs/output_00.txt through outputs/output_[N].txt.
Follow the prompt exactly. One output per test case. No commentary.
references/, see also, file paths, [[wikilinks]], or any mentions of other files, and pass their full absolute paths.python eval.py outputs/METRIC pass_rate=X.XXXX lineIf hybrid mode is enabled:
autoresearch-judge sub-agent with:
Score the outputs in [path to outputs/] against the rubric at [path to rubric.md].
Save your scores to judge-scores.json in the working directory.
quality_score from the resulting judge-scores.jsoncombined_score = (0.6 × pass_rate) + (0.4 × quality_score)
Use combined_score for the keep/discard decision instead of pass_rate alone.| Condition | Action |
|---|---|
| Metric improved (pass_rate or combined_score) | KEEP — the change stays, update .backup to the new version |
| Metric same or worse | DISCARD — restore from .backup |
| Eval crashed | CRASH — restore from .backup, note the error |
autoresearch-log.jsonl (see JSONL format below)autoresearch-worklog.md with a human-readable entryautoresearch-dashboard.html with current statsRule 1: You are the optimizer. You NEVER generate outputs, write eval code, or score quality.
Rule 2: The eval agent writes eval.py (and rubric.md) once, then disappears.
Rule 3: The test runner sub-agent NEVER sees the eval or rubric.
Rule 4: eval.py is the deterministic judge. It is READ-ONLY.
python eval.py outputs/ and read the numberRule 5: The judge agent NEVER sees eval.py or iteration history (hybrid mode).
All state lives in the working directory:
| File | Modifiable? | Purpose |
|---|---|---|
| Target skill/prompt file | YES | The file being optimized |
eval.py | NO — READ-ONLY | The deterministic judge script |
test_cases.json | NO — READ-ONLY | Test inputs |
rubric.md | NO — READ-ONLY (hybrid only) | LLM judge scoring rubric |
outputs/ | Overwritten each iteration (by test runner) | Generated outputs for eval |
judge-scores.json | Overwritten each iteration (hybrid only) | LLM judge scores |
autoresearch-session.md | No | Session config (includes eval mode) |
autoresearch-log.jsonl | Append-only | Machine-readable log |
autoresearch-worklog.md | Append-only | Human-readable narrative |
autoresearch-dashboard.html | Rewritten each iteration | Live visual dashboard |
autoresearch-ideas.md | Yes | Backlog of ideas |
Deterministic mode:
{"iteration": 1, "timestamp": "2026-04-01T10:30:00Z", "hypothesis": "Add urgency word to subject line", "metric_name": "pass_rate", "metric_value": 0.70, "baseline": 0.50, "best_so_far": 0.70, "delta": "+0.20", "eval_mode": "deterministic", "assertion_breakdown": {"word_count": 10, "no_buzzwords": 9, "has_question": 7}, "guard_pass": true, "status": "keep"}
Hybrid mode:
{"iteration": 1, "timestamp": "2026-04-01T10:30:00Z", "hypothesis": "Add personal story opener", "metric_name": "combined_score", "metric_value": 0.72, "pass_rate": 0.80, "quality_score": 0.60, "baseline": 0.50, "best_so_far": 0.72, "delta": "+0.22", "eval_mode": "hybrid", "assertion_breakdown": {"word_count": 10, "no_buzzwords": 9}, "rubric_breakdown": {"emotional_resonance": 3.8, "authenticity": 3.2}, "guard_pass": true, "status": "keep"}
The assertion_breakdown shows per-assertion pass counts. The rubric_breakdown (hybrid only) shows average scores per rubric criterion. Both help target the weakest areas.
Status values: baseline, keep, discard, crash, no-op
Use the template at references/dashboard-template.html. On every iteration, rewrite the full HTML file with updated values:
autoresearch/[goal-slug]-[date] before startinggit checkout -- [file] on every "discard" iteration[filename].backup.backup.backup to the new versionThe loop runs until:
When stopping, write a final summary to autoresearch-worklog.md with:
autoresearch-ideas.md)autoresearch-test-runner sub-agent.autoresearch-eval-agent sub-agent.autoresearch-judge sub-agent.