Help us improve
Share bugs, ideas, or general feedback.
From primitives
Autonomously optimize any Claude Code skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on. Outputs: an improved SKILL.md, a results log, and a changelog of every mutation tried.
npx claudepluginhub iamladi/cautious-computing-machine --plugin primitivesHow this skill is triggered — by the user, by Claude, or both
Slash command
/primitives:autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Adapt Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, optimize skill prompts: run the target skill repeatedly against a fixed eval suite, mutate the prompt one change at a time, keep what improves the score, discard what doesn't.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Breaks plans, specs, or PRDs into thin vertical-slice issues on the project issue tracker using tracer bullets. Useful for converting high-level work into grabbable implementation tickets.
Share bugs, ideas, or general feedback.
Adapt Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, optimize skill prompts: run the target skill repeatedly against a fixed eval suite, mutate the prompt one change at a time, keep what improves the score, discard what doesn't.
Related rails: karpathy-principles collects the four Karpathy coding-with-AI principles this skill's methodology comes from.
Most skills work ~70% of the time. The remaining 30% is usually a handful of recurring failure patterns that one targeted mutation can fix. The point of this skill is to find and apply those mutations empirically, not to rewrite the whole prompt from guesswork.
Empirical improvement > Reproducibility > Simplicity > Runtime
Run at high thinking effort. The core loop is hypothesis → mutation → score; each step benefits from careful reasoning about which failure pattern to target next, which is exactly what high effort is for.
This skill mutates a working copy of the target skill, never the original. The original SKILL.md stays untouched throughout — the user can review, diff, and manually apply changes once autoresearch finishes.
Confirm all of the following with the user before starting any experiments. Missing any field, ask — don't guess.
SKILL.md to optimize.references/eval-guide.md for how to write evals that don't get gamed.Once confirmed, read the target skill in full — the SKILL.md and every file it references under references/. Identify its core job, process steps, output format, and any existing quality checks or anti-patterns. You can't improve a skill you haven't understood.
Every eval is binary — pass or fail, no scales. Scales compound variability; yes/no is consistent across runs.
Format each eval as:
EVAL [number]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like — specific enough to be reproducible]
Fail condition: [What triggers a "no"]
Good evals are specific ("Are all words spelled correctly with no truncated sentences?"), bad evals are vague ("Is the text readable?") or gameable ("Contains fewer than 200 words" — will make the skill optimize for brevity at the expense of everything else).
Cap at 3–6 evals. More than that and the skill starts parroting eval criteria back instead of improving. See references/eval-guide.md for detailed examples.
Max score = (number of evals) × (runs per experiment). Example: 4 evals × 5 runs = 20.
Everything lives in autoresearch-[skill-name]/ inside the skill's folder:
autoresearch-[skill-name]/
├── dashboard.html # live browser dashboard (auto-refreshes every 10s)
├── results.json # data file powering the dashboard
├── results.tsv # score log, one row per experiment
├── changelog.md # detailed mutation log
├── SKILL.md.baseline # original skill snapshot — revert target
└── [user-chosen-name].md # the mutating copy — all edits land here
The original SKILL.md is never modified. When the user asks what to name the mutating copy, default to [original]-v2 or [original]-optimized but let the user pick.
Generate dashboard.html as a single self-contained file: inline CSS and JavaScript, Chart.js loaded from CDN, auto-refreshes every 10 seconds by fetching results.json. Open it immediately (open dashboard.html on macOS) so the user sees progress live.
It should show: score progression line chart (experiment # on X, pass rate % on Y), one colored bar per experiment (green = keep, red = discard, blue = baseline), a table of all experiments with score and description, per-eval breakdown of which evals pass most/least across runs, and a status line (Running experiment N… / Idle / Done). Styling: soft colors, white background, clean sans-serif.
{
"skill_name": "[name]",
"status": "running",
"current_experiment": 3,
"baseline_score": 70.0,
"best_score": 90.0,
"experiments": [
{
"id": 0,
"score": 14,
"max_score": 20,
"pass_rate": 70.0,
"status": "baseline",
"description": "original skill — no changes"
}
],
"eval_breakdown": [
{"name": "Text legibility", "pass_count": 8, "total": 10},
{"name": "Pastel colors", "pass_count": 9, "total": 10}
]
}
Update results.json after every experiment so the dashboard stays current. On termination, set status: "complete".
Tab-separated, one row per experiment:
experiment score max_score pass_rate status description
0 14 20 70.0% baseline original skill — no changes
1 16 20 80.0% keep added explicit instruction to avoid numbering in diagrams
2 16 20 80.0% discard tried enforcing left-to-right layout — no improvement
Before mutating anything, measure the starting point. This is experiment #0.
"What should I call the optimized version?").autoresearch-[skill-name]/.SKILL.md to the working directory twice: once as [user-chosen-name].md (the file you mutate) and once as SKILL.md.baseline (the revert target).results.tsv with the header row, results.json with baseline scaffolding, and dashboard.html. Open the dashboard.N times using the test inputs — for every test input, a full run of [user-chosen-name].md. Score every output against every eval.results.tsv and results.json.If baseline is already ≥ 90%, confirm with the user whether to proceed — the skill may not need optimization.
Once started, run autonomously. The loop is: analyze failures → form one hypothesis → mutate → run → score → keep or discard → log → repeat.
Analyze failures. Read the actual outputs that failed. Look at which evals are failing most across every run. Identify the pattern — formatting issue? Missing instruction? Ambiguous directive?
Form one hypothesis. Pick exactly one change to try this round. Changing five things at once means you can't attribute the score delta to any of them.
Mutations that usually help:
Mutations that usually don't:
Apply the change. Edit [user-chosen-name].md only. The original SKILL.md stays untouched.
Run the experiment. Execute the mutated skill N times with the same test inputs used for baseline.
Score every output against every eval. Compute the total score across all runs.
Keep or discard.
[user-chosen-name].md.Log. Append a row to results.tsv, update results.json, and append a detailed entry to changelog.md (see format below).
The loop runs autonomously because the user is often away from the computer. It terminates when any of the following are true:
These are the only reasons to stop. Asking the user mid-loop "should I continue?" defeats the point of running autonomously — the user chose autoresearch precisely to avoid that interruption.
If you run out of mutation ideas before termination: re-read the failing outputs for patterns you missed, combine two previous near-miss mutations into one test, try a completely different approach to the same failure, or try removing things instead of adding. Simplification that holds the score is itself a win.
After every experiment (kept or discarded):
## Experiment [N] — [keep/discard]
**Score:** [X]/[max] ([percent]%)
**Change:** [One sentence describing the mutation]
**Reasoning:** [Why this change was expected to help]
**Result:** [What actually happened — which evals improved or declined]
**Failing outputs:** [Brief description of what still fails, if anything]
The changelog is the most valuable artifact of an autoresearch run. It's a research log a future model (or a smarter future version of the same model) can pick up and continue from.
When the loop terminates, present:
[user-chosen-name].md (in the working directory — the original SKILL.md is untouched).results.tsv and changelog.md for reference.Don't offer to overwrite the original skill. Don't copy the working file over SKILL.md. The whole point is that the original stays safe — the user decides whether and how to apply the improvements.
Context confirmed:
~/.claude/skills/diagram-generator/SKILL.mdExperiment 0 (baseline): 32/40 (80%). Common failures: 3 diagrams had numbered steps, 2 had bright red, 3 had illegible small text.
Experiment 1 — keep (35/40, 87.5%). Change: added an anti-pattern for step numbers, ordinals, and numerical ordering. Numbering failures dropped from 3 to 1.
Experiment 2 — discard (34/40, 85%). Change: added minimum 14px font requirement. Legibility ticked up by 1 but color compliance dropped by 2. Reverted.
Experiment 3 — keep (37/40, 92.5%). Change: replaced vague "pastel colors" with specific hex codes. Color eval went 8/10 → 10/10.
Experiment 4 — discard (37/40, 92.5%). Change: added anti-pattern for specific bright colors. No change — the hex codes already solved the color problem. Reverted to keep the skill simpler.
Experiment 5 — keep (39/40, 97.5%). Change: added a worked example showing a correct diagram. Hit 39/40. One remaining failure: complex diagram with overlapping labels. Diminishing returns — terminated.
Final: baseline 32/40 (80%) → final 39/40 (97.5%). 5 experiments, 3 kept, 2 discarded. Top changes: specific hex codes, explicit anti-numbering rule, worked example. Remaining issue: very complex diagrams occasionally overlap labels (1/40 failure rate).
A good autoresearch run:
If the skill "passes" every eval but output quality hasn't actually improved, the evals are gamed — not the skill. Go back to eval design and write better ones.
Feeds in: any existing skill that needs optimization; user-supplied eval criteria (or help writing them via references/eval-guide.md).
Feeds out: the improved skill copy (user decides whether to adopt it); the changelog (portable — a future model can pick up and continue optimizing); the eval suite (reusable whenever the skill is edited).