Autonomously optimizes Claude Code skills by iteratively running them on test inputs, scoring against binary evals, reflecting on failures to mutate prompts, and archiving improvements. Invoke via /auto-optimize for skill enhancement or autoresearch.
From skill-creator-pronpx claudepluginhub leejuoh/claude-code-zero --plugin skill-creator-proThis skill uses the workspace's default tool permissions.
references/reflection-guide.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.
This skill adapts Andrej Karpathy's autoresearch methodology to Claude Code skills, enhanced with:
Take any existing skill, define what "good output" looks like as binary yes/no checks, then run an autonomous loop that:
Output: An improved SKILL.md + results.json log + structured changelog.md + a live HTML dashboard.
STOP. Do not run any experiments until the fields below are confirmed with the user. Ask for any missing fields before proceeding.
Do NOT ask the user for eval criteria yet. Evals come from observing real failures, not from guessing upfront.
Before changing anything, read and understand the target skill completely.
references/ that the skill links to${CLAUDE_SKILL_DIR}/../skill-creator-pro/references/design-patterns.md -- gotchas patterns, progressive disclosure, description-as-triggerdocs/reference/skill-lessons-from-anthropic.md -- Anthropic's practical lessons from building hundreds of skillshttps://code.claude.com/docs/llms.txt and the relevant page to verify against the latest specDo NOT skip this. You need to understand both the skill AND what makes skills work before you can improve it.
Run the skill 3-5 times AS-IS using the test inputs. Do NOT score anything yet -- just collect outputs and observe.
autoresearch-[skill-name]/ as sibling to the skillSKILL.md.baselineWhile reviewing outputs, identify failure patterns:
The highest-signal content comes from real failure points, not theoretical checklists.
Based on the failure patterns you observed in discovery runs, propose 3-6 binary eval criteria to the user.
Format each eval as:
EVAL [number]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like -- be specific]
Fail condition: [What triggers a "no"]
Present the proposed evals and explain which observed failures each one targets. The user confirms, adjusts, or adds their own.
Rules for good evals (see eval writing guide for details):
Max score calculation:
max_score = [number of evals] x [runs per experiment]
IMPORTANT: Do not proceed to the experiment loop until the user confirms the eval criteria.
Before running experiments, create a live HTML dashboard at autoresearch-[skill-name]/dashboard.html and open it.
The dashboard must:
Generate as a single self-contained HTML file with inline CSS and JavaScript. Use Chart.js from CDN for the chart. The JS should fetch results.json and re-render.
Open it immediately after creating it: open dashboard.html
Update results.json after every experiment so the dashboard stays current:
{
"skill_name": "[name]",
"status": "running",
"current_experiment": 3,
"baseline_score": 70.0,
"best_score": 90.0,
"consecutive_discards": 0,
"experiments": [
{
"id": 0,
"score": 14,
"max_score": 20,
"pass_rate": 70.0,
"status": "baseline",
"description": "original skill -- no changes",
"reflection": null,
"per_eval": [
{"name": "Text legibility", "passed": 4, "total": 5},
{"name": "Color contrast", "passed": 3, "total": 5}
]
}
]
}
When the run finishes, update status to "complete".
Now score the discovery run outputs (from Step 2) against the confirmed evals. This is experiment #0.
IMPORTANT: After establishing baseline, confirm the score with the user before proceeding. If baseline is already 90%+, the skill may not need optimization -- ask if they want to continue.
This is the core autoresearch loop. Once started, run autonomously until stopped.
LOOP:
This is where auto-optimize diverges from blind mutation. Instead of guessing what might help, you read the evidence.
Collect failed outputs. From the most recent experiment, gather every output that failed at least one eval.
Read the failures. Actually read the failing outputs -- not just the scores. Look for:
Diagnose the root cause. Map each failure pattern back to a specific gap or ambiguity in the SKILL.md. Ask: "What instruction caused this, or what missing instruction allowed this?"
Form a targeted hypothesis. Not "maybe I should add something about X" but a specific, falsifiable claim:
Bad: "Adding a color instruction might help" Good: "Eval 3 fails because the skill says 'use appropriate colors' -- the model picks neon green 60% of the time. Replacing with a specific hex palette should fix this."
The hypothesis must be specific enough that after running the experiment, you can say whether it was right or wrong. If you can't articulate what you expect to change and why, you're guessing.
For detailed reflection techniques and examples, read ${CLAUDE_SKILL_DIR}/references/reflection-guide.md.
Based on the reflection, make ONE targeted change to SKILL.md.
For mutation strategies (mechanical, principle-based, structural), consult ${CLAUDE_SKILL_DIR}/../skill-creator-pro/references/design-patterns.md. Key principle: one change at a time so you know what helped.
Bad mutations:
Execute the skill [N] times with the same test inputs. Score every output against every eval. Record both total score and per-eval scores.
Before deciding, check two things:
Per-eval regression check. Compare each eval's pass count against the current best. If any single eval dropped by 2+ passes while the total score went up, that's the "balloon effect" -- one area improving while another quietly breaks. Flag it in the log and discard unless the overall gain clearly outweighs the regression.
Marginal improvement check. If the total score improved by less than 5% absolute (e.g., 70% to 73%), mark the experiment as "marginal" in the log. Still keep it, but note the uncertainty. If you accumulate 3+ marginal keeps in a row without a clear win, be suspicious -- the score may have drifted up through luck, not real improvement. Consider re-running the current SKILL.md against baseline to verify the cumulative gain is real.
Decision:
After deciding, update results.json and the structured archive.
Track consecutive discards. When the count hits thresholds, escalate:
3 consecutive discards -- soft reset:
5 consecutive discards -- hard reset:
8 consecutive discards -- plateau:
Reset the consecutive discard counter whenever a mutation is kept.
NEVER STOP. Once the loop starts, do not pause to ask the user. They may be away. Run autonomously until:
If you run out of ideas: Re-read the reflection guide. Try a completely different approach. Try removing things instead of adding them. Simplification that maintains the score is a win. Re-read the design principles from Step 1 -- there may be a pattern you haven't tried yet. If the skill uses platform features (hooks, frontmatter, allowed-tools) and failures seem structural, fetch the official docs (https://code.claude.com/docs/en/skills.md or hooks.md) to check if the skill's usage matches the current spec.
The changelog is NOT just a flat log -- it's a structured document that future sessions (or fresh contexts) can read to continue where you left off.
Maintain autoresearch-[skill-name]/changelog.md with these sections:
## Current Understanding
**What works:**
- Specific hex color codes prevent neon color failures (Exp 3)
- Worked examples are more effective than rules for formatting (Exp 5)
**What doesn't work:**
- Font size instructions alone don't fix legibility -- model ignores px values (Exp 2)
- Vague color descriptions ("pastel", "soft") are unreliable
**Remaining failures:**
- Eval 4 (label formatting) still fails 30% -- labels overlap on dense diagrams
## Experiment [N] -- [keep/discard/marginal]
**Score:** [X]/[max] ([percent]%)
**Per-eval:** [Eval1: 5/5] [Eval2: 3/5 DOWN] [Eval3: 4/5]
**Hypothesis:** [What you diagnosed from reflection]
**Change:** [One sentence describing what was changed]
**Result:** [What actually happened -- which evals improved/declined]
**Failing outputs:** [Brief description of what still fails]
## Ideas Backlog
- [ ] Try on-demand hook to block destructive operations
- [ ] Move the API reference table to references/ -- 40 lines of noise in main body
- [x] ~~Add worked example for edge case~~ (tried Exp 5, kept)
- [x] ~~Increase font size instruction~~ (tried Exp 2, didn't work)
When the user returns or the loop stops, present:
All files in autoresearch-[skill-name]/:
autoresearch-[skill-name]/
dashboard.html # live browser dashboard (auto-refreshes)
results.json # data file powering the dashboard
changelog.md # structured archive (understanding + log + ideas)
SKILL.md.baseline # original skill before optimization
Plus the improved SKILL.md saved back to its original location.
Baseline: 16/20 (80%) -- 4 evals x 5 runs.
| # | Reflection -> Mutation | Score | Result |
|---|---|---|---|
| 1 | "Outputs include '1.' '2.' prefixes -- skill says 'list steps' which implies numbering" -> Added "Do NOT include step numbers in diagram labels" | 18/20 (90%) | keep |
| 2 | "Text <10px. Tried px minimum but model ignores pixel values" -> Added "minimum 14px font size" | 17/20 (85%) | discard -- Eval 2 regressed, Eval 3 only +1 |
| 3 | "Model picks neon green 60% when skill says 'appropriate colors' -- root cause is ambiguity" -> Replaced with specific hex palette | 19/20 (95%) | keep |
| 4 | "Only 1 failure left. Tried anti-pattern for neon" -> Added neon color anti-pattern | 19/20 (95%) | discard -- hex codes already solved it, zero gain |
| 5 | "Label text correct but overlaps in small boxes -- need visual example" -> Added worked example for label placement | 20/20 (100%) | keep |
Result: 80% -> 100% in 5 experiments (3 kept, 2 discarded). Key: reflection found px values don't work (Exp 2) so Exp 3 tried a different approach instead of retrying the same tactic.
A good autoresearch run:
If the skill "passes" all evals but the actual output quality hasn't improved -- the evals are bad, not the skill. Go back to step 2 and write better evals.