Help us improve
Share bugs, ideas, or general feedback.
From north-starr
Autonomously improves Claude Code skill prompts via measure-change-test loop: runs skill on test inputs, scores output against checklist, iteratively refines changes. Use for 'improve skill', 'autoimprove', or optimization requests.
npx claudepluginhub selcukyucel/north-starr --plugin north-starrHow this skill is triggered — by the user, by Claude, or both
Slash command
/north-starr:autoimproveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Iteratively improve any skill's prompt using the autoresearch pattern: **small change -> measure -> keep/revert -> repeat**. The agent runs the target skill, scores output against a user-defined checklist, makes one targeted prompt change per round, and keeps only changes that improve the score.
Autonomously optimizes skill prompts using a mutate/score/keep evolutionary loop with git-based revert. Useful for improving SKILL.md performance over time.
Autonomously optimizes Claude Code skills by iteratively running them on test inputs, scoring against binary evals, reflecting on failures to mutate prompts, and archiving improvements. Invoke via /auto-optimize for skill enhancement or autoresearch.
Autonomously improves other Claude Code skills: diagnoses rubric-based weaknesses, experiments changes, A/B evaluates via scoring, reports winners. Preserves originals.
Share bugs, ideas, or general feedback.
Iteratively improve any skill's prompt using the autoresearch pattern: small change -> measure -> keep/revert -> repeat. The agent runs the target skill, scores output against a user-defined checklist, makes one targeted prompt change per round, and keeps only changes that improve the score.
Inspired by Karpathy's autoresearch — the same hill-climbing loop applied to ML training, here adapted for skill prompt optimization.
Use this skill when the user requests:
Actions:
SKILL.md file to understand its current promptValidation:
SKILL.md fileAsk the user for test inputs — the scenarios the skill will be run against each round.
What test input should I use when running this skill?
Examples:
- For /generate-commit: "use the current staged changes"
- For /analyze-code: "run it on src/auth/middleware.ts"
- For /generate-pr: "use the current branch diff"
You can provide 1-3 test inputs. More inputs = more robust but slower rounds.
If the user provides no test input, suggest reasonable defaults based on the skill's purpose.
The checklist is the only metric. Each item is a yes/no question that checks one specific quality of the skill's output.
First, offer to help generate the checklist:
I need a scoring checklist — 3-6 yes/no questions that define what "good output" looks like
for this skill. I can help you create one.
Option 1: I'll analyze the skill and propose a checklist (recommended)
Option 2: You provide your own checklist
Option 3: I'll propose one, then you refine it
If generating the checklist (Option 1 or 3):
Checklist anti-patterns to avoid:
Example checklist for /generate-commit:
1. Does the subject line start with a conventional commit prefix (feat/fix/docs/refactor/test/chore)?
2. Is the subject line under 72 characters?
3. Does the body explain WHY the change was made, not just WHAT changed?
4. Is the message free of filler phrases like "various improvements" or "minor updates"?
Present the checklist and get user approval before proceeding.
Actions:
.plans/autoimprove-<skill-name>/.plans/autoimprove-<skill-name>/SKILL-original.md as backupPresent the baseline:
Baseline Results
────────────────
Skill: /[skill-name]
Test runs: [count]
Score: [X]/[total] ([percentage]%)
Checklist breakdown:
[x] Question 1 — passed [N]/[N] runs
[ ] Question 2 — passed [N]/[N] runs
[x] Question 3 — passed [N]/[N] runs
[ ] Question 4 — passed [N]/[N] runs
Weakest items: Question 2, Question 4
Starting optimization loop. I'll make one change per round,
targeting the weakest checklist items first.
Write the baseline to .plans/autoimprove-<skill-name>/results.tsv:
round change score kept details
0 baseline [score] - Initial score: [X]/[total] per-item breakdown
Repeat until stop condition is met:
Look at which checklist items are failing most across test runs. Identify the single weakest item — this is the optimization target for this round.
Read the current skill prompt and identify which part of the prompt is responsible for the failing behavior. Common prompt weaknesses:
Formulate exactly one small, targeted change to the skill prompt. The change should directly address the weakest checklist item.
Types of changes (in order of preference):
Rules for changes:
If score improved (new score > previous best score):
If score stayed the same or decreased:
Important edge case: A change might improve one checklist item but worsen another. Only keep if the total score improved. If total is the same but distribution shifted, revert — we want broad improvement, not whack-a-mole.
Append to .plans/autoimprove-<skill-name>/results.tsv:
[round] [change description] [new score] [kept/reverted] [per-item breakdown]
Stop the loop if ANY of these are true:
After each round, show a brief status:
Round [N]: [kept/reverted]
Change: [one-line description]
Target: Question [X] — [question text]
Score: [old]% -> [new]% [arrow up/down/same]
Status: [X] kept, [Y] reverted so far
Then continue to the next round without waiting for user input (unless at a checkpoint — see Step 6).
After every 5 rounds, pause and ask the user:
Progress Check (Round [N])
──────────────────────────
Starting score: [baseline]%
Current score: [current]%
Changes kept: [count]
Changes reverted: [count]
Continue optimizing? (y/n/adjust checklist)
This prevents runaway loops and lets the user adjust the checklist if the optimization is heading in the wrong direction.
If the user says "autopilot" or "don't ask me" at any checkpoint, skip future checkpoints and run until a stop condition is hit.
When the loop stops, generate three artifacts:
Save the optimized prompt to .plans/autoimprove-<skill-name>/SKILL-improved.md
Do NOT overwrite the original skill. The user decides whether to adopt the improved version.
The results.tsv file already has every round. Add a final summary row:
FINAL - [final score] - Improved from [baseline]% to [final]%. [kept] changes kept, [reverted] reverted across [total] rounds.
Write .plans/autoimprove-<skill-name>/CHANGELOG.md:
# Autoimprove Changelog: /[skill-name]
**Date:** [date]
**Baseline Score:** [X]%
**Final Score:** [Y]%
**Rounds:** [total] ([kept] kept, [reverted] reverted)
## Changes Applied (in order)
### Round [N] — KEPT
**Target:** [checklist item]
**Change:** [what was changed in the prompt]
**Why:** [what failure this addressed]
**Score:** [before]% -> [after]%
### Round [N] — REVERTED
**Target:** [checklist item]
**Change:** [what was attempted]
**Why it failed:** [why it didn't improve the score]
**Score:** [before]% -> [after]%
[...repeat for all rounds]
## Checklist Performance
| Question | Baseline | Final | Delta |
|----------|----------|-------|-------|
| [Q1] | [X]/[N] | [Y]/[N] | [+/-] |
| [Q2] | [X]/[N] | [Y]/[N] | [+/-] |
## Recommendations
[Any observations about remaining failures that can't be fixed through prompt changes —
e.g., "Question 3 fails when the input is very short; this may be an inherent limitation
rather than a prompt issue."]
Autoimprove Complete: /[skill-name]
────────────────────────────────────
Score: [baseline]% -> [final]% ([+delta]%)
Rounds: [total] ([kept] kept, [reverted] reverted)
Files:
Original backup: .plans/autoimprove-<skill-name>/SKILL-original.md
Improved version: .plans/autoimprove-<skill-name>/SKILL-improved.md
Results log: .plans/autoimprove-<skill-name>/results.tsv
Changelog: .plans/autoimprove-<skill-name>/CHANGELOG.md
To adopt the improved skill:
cp .plans/autoimprove-<skill-name>/SKILL-improved.md skills/<skill-name>/SKILL.md
After presenting the final summary, offer to capture learnings:
The changelog contains [N] insights about what works and what doesn't
for this skill. Want me to run /learn to capture these as pattern rules?
If the user agrees, run /learn with the changelog as context. The patterns discovered during optimization (e.g., "banned buzzword lists improve output quality") become reusable rules.
When scoring skill output against the checklist, follow these rules:
Scoring format per test input:
Test input: [input description]
Q1: [YES/NO] — [brief evidence]
Q2: [YES/NO] — [brief evidence]
Q3: [YES/NO] — [brief evidence]
Score: [X]/[total]
/analyze-code.plans/autoimprove-<skill-name>/