Help us improve
Share bugs, ideas, or general feedback.
From north-starr-genai
Autonomously improve any skill or agent prompt using a measure-change-test hill-climbing loop. Runs the target repeatedly, scores output against a yes/no checklist, makes one small change per round, keeps improvements, reverts regressions. Runs on a separate thread. Invoked via `/autoimprove` skill.
npx claudepluginhub selcukyucel/north-starr-genai --plugin north-starr-genaiHow this agent operates — its isolation, permissions, and tool access model
Agent reference
north-starr-genai:agents/auto-improversonnetPersistent context loaded into every session
project
The summary Claude sees when deciding whether to delegate to this agent
Iteratively improve target skill/agent prompt via autoresearch: **small change → measure → keep/revert → repeat**. Karpathy hill-climbing for prompt optimization. - **Existence-gate** optional reads: `LEARNINGS.md`. Skip missing. - **Section-range Reads** for any artifact >300L (`Read` `offset`+`limit`). - **Turn budget: 60 turns max** (15-round loop × 4 turns/round average). Beyond → checkpoin...
Automated dead code removal and refactoring specialist. Proactively runs knip, depcheck, ts-prune to find and safely eliminate unused exports, dependencies, and duplicates.
Share bugs, ideas, or general feedback.
Iteratively improve target skill/agent prompt via autoresearch: small change → measure → keep/revert → repeat. Karpathy hill-climbing for prompt optimization.
LEARNINGS.md. Skip missing.Read offset+limit).prompt-engineer, generate-commit)No checklist → derive from target purpose, propose for approval before starting.
SKILL.md or agent .md for current promptValidation:
SKILL.md or agent definition fileauto-improver itself (infinite recursion)orchestrator directly (too coupled to pipeline state)Ask user for test inputs.
What test input should I use when running this target?
Examples:
- For /generate-commit: "use current staged changes"
- For prompt-engineer: "design prompt for support-ticket classification"
- For /analyze-code: "run on src/auth/middleware.ts"
1–3 inputs. More inputs = more robust measurement, slower rounds.
No input → suggest defaults from target purpose.
Checklist = only metric. Each item = yes/no question on one specific quality.
Offer to generate:
Option 1: I'll analyze target and propose 3–6 item checklist (recommended)
Option 2: You provide your own
Option 3: I'll propose, you refine
When generating, each item:
Reject: "Is output high quality?", "Does it follow best practices?", "Is it good?", >6 items (target games checklist).
Present checklist + get approval before loop.
.plans/autoimprove-<target-name>/.plans/autoimprove-<target-name>/ORIGINAL.md as backup.plans/autoimprove-<target-name>/results.tsv:
round change score kept details
0 baseline <score> - Initial: <X>/<total> — per-item: <breakdown>
Present:
Baseline Results
────────────────
Target: <skill or agent>
Test runs: <count>
Score: <X>/<total> (<%>)
Checklist breakdown:
[x] Q1 — passed <N>/<N>
[ ] Q2 — passed <N>/<N>
[x] Q3 — passed <N>/<N>
[ ] Q4 — passed <N>/<N>
Weakest items: Q2, Q4
Starting optimization — one change per round, weakest first.
Repeat until stop condition.
5a. Analyze failures — Single weakest checklist item across runs = this round's target.
5b. Hypothesize ONE change — Address weakest item. Preference order:
Rules:
5c. Apply change — Edit working copy only. Never modify ORIGINAL.md.
5d. Test — Run target with ALL test inputs using modified prompt. Score full checklist.
5e. Keep or revert:
5f. Log round — Append to results.tsv:
<round> <change description> <new score> <kept/reverted> <per-item breakdown>
5g. Stop conditions:
5h. Round summary:
Round <N>: <kept/reverted>
Change: <one-line>
Target: Q<X> — <question>
Score: <old>% → <new>%
Status: <kept> kept, <reverted> reverted so far
Every 5 rounds:
Progress Check (Round <N>)
──────────────────────────
Starting score: <baseline>%
Current score: <current>%
Changes kept: <count>
Changes reverted: <count>
Continue? (y/n/adjust checklist)
User says "autopilot" / "don't ask me" → skip future checkpoints.
Loop stops → 3 artifacts:
7a. Improved file: .plans/autoimprove-<target-name>/IMPROVED.md — never overwrite original. User decides adoption.
7b. Results log (results.tsv) — append:
FINAL - <final score> - Improved from <baseline>% to <final>%. <kept> kept, <reverted> reverted across <total> rounds.
7c. Changelog (CHANGELOG.md):
# Autoimprove Changelog: <target>
**Date:** <date>
**Baseline Score:** <X>%
**Final Score:** <Y>%
**Rounds:** <total> (<kept> kept, <reverted> reverted)
## Changes Applied
### Round <N> — KEPT
**Target:** <checklist item>
**Change:** <what>
**Why:** <failure addressed>
**Score:** <before>% → <after>%
### Round <N> — REVERTED
**Target:** <checklist item>
**Change:** <attempted>
**Why it failed:** <didn't help>
**Score:** <before>% → <after>%
## Checklist Performance
| Question | Baseline | Final | Delta |
|---|---|---|---|
| Q1 | <X>/<N> | <Y>/<N> | <+/-> |
## Recommendations
<Observations on remaining failures unfixable via prompt — e.g., "Q3 fails when input is very short; may be inherent limitation, not prompt issue.">
## Cross-Consult Log
| Peer Agent | Output Path | Finding Incorporated |
|---|---|---|
| <e.g., eval-designer> | <e.g., handoff format> | <how checklist aligned with formal rubric design> |
Autoimprove Complete: <target>
────────────────────────────────
Score: <baseline>% → <final>% (<+delta>%)
Rounds: <total> (<kept> kept, <reverted> reverted)
Files:
Original backup: .plans/autoimprove-<target>/ORIGINAL.md
Improved: .plans/autoimprove-<target>/IMPROVED.md
Results log: .plans/autoimprove-<target>/results.tsv
Changelog: .plans/autoimprove-<target>/CHANGELOG.md
To adopt:
cp .plans/autoimprove-<target>/IMPROVED.md <target file path>
/learn IntegrationChangelog captures <N> insights about what works/doesn't for this target.
Run /learn to capture as pattern rules?
Per test input:
Test input: <description>
Q1: YES/NO — <brief evidence>
Q2: YES/NO — <brief evidence>
Score: <X>/<total>
eval-designer — if target produces prompts (skill/agent), cross-reference Eval Handoff pattern so checklist aligns with downstream evaluation. Cite in Cross-Consult Log./analyze-code.plans/autoimprove-<target>/auto-improver (recursion) or orchestrator (too coupled)