From autoresearch
Autonomous goal-directed iteration loop that continuously improves prompts, templates, configs, or code. Two evaluation modes — deterministic (eval.py with proxy heuristics) or AI judge (LLM rubric scoring). Uses four-way separation in both modes. Inspired by Karpathy's autoresearch.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
USE WHEN the user runs `/autoresearch`, says "autoresearch", "optimize this prompt", "improve this overnight", "run an optimization loop", "iterate on this", "auto-improve", or wants to autonomously refine a file against measurable criteria.
USE WHEN the user runs /autoresearch, says "autoresearch", "optimize this prompt", "improve this overnight", "run an optimization loop", "iterate on this", "auto-improve", or wants to autonomously refine a file against measurable criteria.
This skill supports two mutually exclusive evaluation modes. Both maintain four-way separation.
| Role | Who | Knows Eval Code? | Knows Prompt History? |
|---|---|---|---|
| Main Agent | You (optimizer) | NO — reads metric number only | Yes — reads logs, plans changes |
| Eval Agent | autoresearch-eval-agent sub-agent | Yes — writes eval.py | No |
| Test Runner | autoresearch-test-runner sub-agent | NO — fresh context | NO |
| Judge Script | eval.py (deterministic Python) | IS the eval | No |
Metric: pass_rate. Best for tasks with clear, mechanical quality checks (word count, format, keywords, structure).
| Role | Who | Knows Rubric? | Knows Prompt History? |
|---|---|---|---|
| Main Agent | You (optimizer) | Criteria names only | Yes — reads logs, plans changes |
| Eval Agent | autoresearch-eval-agent sub-agent | Yes — writes rubric.md | No |
| Test Runner | autoresearch-test-runner sub-agent | NO — fresh context | NO |
| Judge Agent | autoresearch-judge sub-agent | Follows rubric | NO — fresh context |
Metric: quality_score. Best for creative/subjective tasks where mechanical checks miss the point (tone, authenticity, narrative, emotional resonance).
The modes are mutually exclusive. Deterministic mode uses eval.py only. AI judge mode uses the judge agent only. There is no combined score.
Key isolation (both modes):
Whenever you need user input or confirmation, use the AskUserQuestion tool. This shows an interactive popup in Cowork.
CRITICAL: Keep popup text SHORT. The popup is small — long text becomes unreadable. Follow this pattern:
Good:
[In chat]: Here are 7 proposed assertions:
1. Covers all 4 search clusters
2. Each entry has title, channel, views
3. Contains opportunity mapping
...
[AskUserQuestion]: "Do these assertions look right?"
Options: "These look good" / "Adjust some" / [free text]
Bad:
[AskUserQuestion]: "Here are 7 proposed assertions for what a 'good' output looks like: 1. Covers all 4 search clusters (Core, Tools, Niche, Competitors)... [giant paragraph]"
Never put lists, explanations, or details inside AskUserQuestion. The popup is for the QUESTION only.
Every criterion — whether proposed by you or provided by the user — MUST pass these three rules before entering the loop. This is the most important step in the process. Bad criteria produce bad evals, and bad evals waste every iteration.
Rule 1: State the exact condition, not the goal. Don't describe what you want. Describe what you can measure.
| Bad | Good |
|---|---|
| "Make sure the hook is short" | "The first line must be under 136 characters including spaces" |
| "Should be professional" | "Contains no exclamation marks and no ALL CAPS words (3+ letters)" |
| "Include relevant data" | "Contains at least one specific number or statistic with a source" |
Rule 2: One criterion, one variable. Each criterion tests exactly one thing. If you're tempted to use "and" to connect two checks, split them into two separate criteria.
| Bad | Good |
|---|---|
| "Under 150 words and ends with a question" | Criterion 1: "Under 150 words" / Criterion 2: "Last sentence ends with a question mark" |
| "Professional tone with no jargon" | Criterion 1: "No words from the banned jargon list" / Criterion 2: "No sentences over 25 words" |
Rule 3: Define the test (optional). Describe how to verify the criterion — what to count, what regex to match, what structure to look for. This helps the eval agent write better checks and helps the judge agent score more consistently.
| Criterion | Test definition |
|---|---|
| "First line under 136 characters" | len(lines[0]) <= 136 |
| "Contains at least one statistic" | re.search(r'\d+[%x]?\s', text) returns a match |
| "Ends with a question" | text.rstrip().endswith("?") |
If the user provides criteria that violate The Three Rules, rewrite them — show the user the before/after so they understand the improvement.
Save it to a file in the working directory (e.g., target-skill.md)
Explain The Three Rules to the user first. Before proposing any criteria, write in chat:
The Three Rules — every criterion must pass these before we start:
- State the exact condition, not the goal. "First line under 136 characters" not "keep the hook short."
- One criterion, one variable. If it has "and", split it into two.
- Define the test (optional). How to check it — what to count, match, or look for.
Propose 5-7 quality criteria. Every criterion MUST pass The Three Rules. Write the list in chat (not in the popup). For each criterion, show it passes the rules — be specific with thresholds, counts, and patterns.
Use AskUserQuestion — short text only:
"Do these quality criteria look right?""These look good" / "Adjust some"SEPARATE STEP — always ask this. Use AskUserQuestion:
"Which evaluation mode?""Deterministic" / "AI Judge"SEPARATE STEP — always ask this. Use AskUserQuestion:
"How many iterations? (Recommended: 5)""5 (recommended)" / "10" / "20"Spawn the autoresearch-eval-agent sub-agent to generate the eval system:
eval.py and test_cases.jsonrubric.md and test_cases.json (no eval.py)Show the user the generated eval.py (deterministic) or rubric.md (AI judge) in chat. Then use AskUserQuestion:
"Does this eval look right?""Looks good" / "Adjust"Mark the eval artifacts as READ-ONLY — you MUST NOT modify them during the loop:
eval.py + test_cases.jsonrubric.md + test_cases.jsonRun baseline (iteration 0): generate outputs via test runner (always live — using real tools and data), then evaluate (run eval.py or spawn judge agent), record the score
Start the loop — stop after the chosen iteration count
If the user says "improve this" without criteria, suggest 5 reasonable ones (all passing The Three Rules) and use AskUserQuestion to confirm.
If the user says "go" or "start", default to deterministic mode + 5 iterations, spawn the eval agent, and begin immediately.
When the user pastes a skill, prompt, or template:
target-skill.md)| Field | Description | Example |
|---|---|---|
| Goal | What are we optimizing? | "Improve cold email reply-rate signals" |
| Eval Mode | deterministic or ai_judge | deterministic |
| Metric | pass_rate (deterministic) or quality_score (AI judge) | pass_rate |
| Direction | Higher is better | higher |
| Criteria | The quality checks (confirmed by user) | "Under 150 words", "Contains a question CTA" |
| Modifiable File | The prompt/skill file ONLY | target-skill.md |
| Guard | Optional regression check | None |
Before any optimization begins, spawn the autoresearch-eval-agent sub-agent:
Deterministic mode:
Design an eval system for the prompt/skill at [path to target skill].
Eval mode: Deterministic.
Assertions (confirmed by user):
1. [assertion 1]
2. [assertion 2]
...
The prompt expects these inputs: [list input fields from the prompt].
Save eval.py and test_cases.json to [working directory path].
AI Judge mode:
Design an eval system for the prompt/skill at [path to target skill].
Eval mode: AI Judge.
Quality criteria (confirmed by user):
1. [criterion 1]
2. [criterion 2]
...
The prompt expects these inputs: [list input fields from the prompt].
Save rubric.md and test_cases.json to [working directory path].
The eval agent will:
test_cases.json + eval.py (deterministic Python with proxy heuristics), verify no syntax errorstest_cases.json + rubric.md (scoring rubric with 1-5 scale per criterion)IMPORTANT: You (the optimizer) MUST NOT read the eval artifacts in detail. You only interact with them by running the evaluation and reading the output (metric number and criteria names). This prevents you from gaming the evaluation.
After the eval agent finishes:
See references/example-eval.py for what a generated eval.py looks like.
outputs/ directoryautoresearch-test-runner sub-agent with this prompt:
Generate outputs using the prompt at [path to target skill].
Test cases are at [path to test_cases.json].
Save each output to outputs/output_00.txt through outputs/output_09.txt (zero-padded index matching test case order).
Follow the prompt exactly. One output per test case. No commentary — just the raw output in each file.
IMPORTANT: Output file naming must be output_XX.txt (zero-padded index). Both the eval agent (when writing eval.py) and the test runner must use this convention. If eval.py uses a different naming scheme, the files won't be found.python eval.py outputs/, parse the METRIC pass_rate=X.XXXX line from stdoutautoresearch-judge sub-agent against outputs/ + rubric.md, parse quality_score from judge-scores.jsonautoresearch-log.jsonlautoresearch-dashboard.html (use template from references/dashboard-template.html)Each iteration follows this exact sequence:
autoresearch-log.jsonlautoresearch-ideas.md if it exists[filename].backupreferences/, linked files, imported data). List all file paths the skill depends on.autoresearch-test-runner sub-agent with:
Execute the prompt at [path to target skill].
Test cases are at [path to test_cases.json].
The working project is at [project path].
Reference files the prompt depends on (read these first):
- [path to references/file1.md]
- [path to references/file2.md]
- [... list ALL referenced files]
Use all available tools (web search, file access, APIs) to produce real outputs.
Save each output to outputs/output_00.txt through outputs/output_[N].txt.
Follow the prompt exactly. One output per test case. No commentary.
references/, see also, file paths, [[wikilinks]], or any mentions of other files, and pass their full absolute paths.python eval.py outputs/, parse the METRIC pass_rate=X.XXXX line. If eval.py crashes, mark this iteration as "crash".autoresearch-judge sub-agent with:
Score the outputs in [path to outputs/] against the rubric at [path to rubric.md].
Save your scores to judge-scores.json in the working directory.
The judge agent has fresh context — it does not know iteration count, prompt changes, or optimization goals. Parse quality_score from judge-scores.json.| Condition | Action |
|---|---|
Metric improved (pass_rate or quality_score) | KEEP — the change stays, update .backup to the new version |
| Metric same or worse | DISCARD — restore from .backup |
| Eval crashed | CRASH — restore from .backup, note the error |
autoresearch-log.jsonl (see JSONL format below)autoresearch-worklog.md with a human-readable entryautoresearch-dashboard.html with current statsRule 1: You are the optimizer. You NEVER generate outputs, write eval code, or score quality.
Rule 2: The eval agent writes the eval system once, then disappears.
Rule 3: The test runner sub-agent NEVER sees the eval or rubric.
Rule 4: The judge is READ-ONLY.
eval.py is the judge. Run it and read the number. Never modify it.rubric.md is the judge's instruction. Never modify it.test_cases.json is always READ-ONLY.Rule 5: The judge agent NEVER sees iteration history (AI judge mode).
All state lives in the working directory:
| File | Modifiable? | Purpose |
|---|---|---|
| Target skill/prompt file | YES | The file being optimized |
eval.py | NO — READ-ONLY (deterministic only) | The deterministic judge script |
rubric.md | NO — READ-ONLY (AI judge only) | LLM judge scoring rubric |
test_cases.json | NO — READ-ONLY | Test inputs |
outputs/ | Overwritten each iteration (by test runner) | Generated outputs for eval |
judge-scores.json | Overwritten each iteration (AI judge only) | LLM judge scores |
autoresearch-session.md | No | Session config (includes eval mode) |
autoresearch-log.jsonl | Append-only | Machine-readable log |
autoresearch-worklog.md | Append-only | Human-readable narrative |
autoresearch-dashboard.html | Rewritten each iteration | Live visual dashboard |
autoresearch-ideas.md | Yes | Backlog of ideas |
Deterministic mode:
{"iteration": 1, "timestamp": "2026-04-01T10:30:00Z", "hypothesis": "Add urgency word to subject line", "metric_name": "pass_rate", "metric_value": 0.70, "baseline": 0.50, "best_so_far": 0.70, "delta": "+0.20", "eval_mode": "deterministic", "assertion_breakdown": {"word_count": 10, "no_buzzwords": 9, "has_question": 7}, "guard_pass": true, "status": "keep"}
AI Judge mode:
{"iteration": 1, "timestamp": "2026-04-01T10:30:00Z", "hypothesis": "Add personal story opener", "metric_name": "quality_score", "metric_value": 0.72, "baseline": 0.50, "best_so_far": 0.72, "delta": "+0.22", "eval_mode": "ai_judge", "rubric_breakdown": {"emotional_resonance": 3.8, "authenticity": 3.2, "narrative_arc": 3.5}, "guard_pass": true, "status": "keep"}
Deterministic logs include assertion_breakdown (per-assertion pass counts). AI judge logs include rubric_breakdown (average scores per criterion). Both help target the weakest areas.
Status values: baseline, keep, discard, crash, no-op
Use the template at references/dashboard-template.html. On every iteration, rewrite the full HTML file with updated values:
autoresearch/[goal-slug]-[date] before startinggit checkout -- [file] on every "discard" iteration[filename].backup.backup.backup to the new versionThe loop runs until:
When stopping, write a final summary to autoresearch-worklog.md with:
autoresearch-ideas.md)autoresearch-test-runner sub-agent.autoresearch-eval-agent sub-agent.autoresearch-judge sub-agent.Quick-reference card listing all ponytail modes (Lite, Full, Ultra), skills, and commands. Useful for discovering or recalling ponytail capabilities.
npx claudepluginhub benaios/benai-skills-main --plugin autoresearch