Autonomously improves other skills via diagnosis of weaknesses, A/B experimentation, evaluation against tiered rubric, and reporting. Preserves originals until approved.
npx claudepluginhub whawkinsiv/solo-founder-superpowers --plugin solo-founder-superpowersThis skill uses the workspace's default tool permissions.
Autonomous skill improvement loop inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch). You modify a skill, evaluate whether it got better, keep or discard, repeat.
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Autonomous skill improvement loop inspired by karpathy/autoresearch. You modify a skill, evaluate whether it got better, keep or discard, repeat.
Your originals are always safe. The loop works on a copy. Nothing changes until you say so.
When the user says "improve [skill name]":
Locate the skill. Read skills/<name>/SKILL.md. If it doesn't exist, ask the user which skill they mean.
Back up the original. Copy the full skills/<name>/ directory to eval/experiments/<name>/. Create eval/experiments/ if it doesn't exist.
mkdir -p eval/experiments/<name>
cp skills/<name>/SKILL.md eval/experiments/<name>/SKILL.md
cp skills/<name>/*.md eval/experiments/<name>/ (if supporting files exist)
Read the design philosophy. Read CLAUDE.md to ground yourself in the plugin's standards. The rubric below is derived from these.
Read neighbor skills. Identify skills with overlapping scope (check the Related Skills section, check skills in the same category from CLAUDE.md). Read their descriptions — you need to know the boundaries.
Initialize the log. Create eval/results.tsv (or append if it exists from a previous run):
experiment skill tier dimension score_before score_after status change_description
Run the baseline diagnosis. Score the original against the rubric (see below). Log it as experiment 0.
Show scores and ask for focus. Print the rubric scores in a compact tiered table:
Tier 1 (Critical): trigger-precision=1 trigger-phrases=1 checklists=2 tell-ai-prompts=2 → 6/12
Tier 2 (Important): boundary=1 tool-specific=N/A founder-pov=2 mistakes=1 → 4/9
Tier 3 (Polish): conciseness=✓ disclosure=✗ scannable=✓ cross-refs=✗
Total: 10/21
(Example shows a knowledge skill where tool-specific is N/A. For implementation skills, all 8 dimensions apply and max is /24.) Then ask:
"I'll focus on the lowest-scoring dimensions, Tier 1 first. Want me to auto-prioritize (default) or focus on specific areas?"
If the user says "go," "auto," or anything non-specific → auto-prioritize (Tier 1 first, then Tier 2). If the user names specific dimensions → focus the loop on those, skip others.
Score each dimension 0-3 (0=missing, 1=weak, 2=adequate, 3=strong). Dimensions are tiered by impact.
These determine whether the skill fires and delivers value. All must reach ≥2 before moving to Tier 2.
These improve quality but only matter if the skill fires correctly.
Not scored individually. Handled as a single cleanup pass after the main loop converges.
Loop score: Tier 1 + Tier 2 = applicable dimensions × 3pts. Max is 24 when all 8 dimensions apply, 21 when tool-specific guidance is N/A.
Tier 3 is not part of the loop score. It's a yes/no sweep after convergence.
REPEAT:
1. PICK the lowest-scoring dimension, respecting tier order:
- Tier 1 dimensions below 2 → always first
- Tier 2 dimensions → only after all Tier 1 ≥ 2
- N/A dimensions → skip entirely, never experiment on them
- Tier 3 → never (handled in polish pass)
- User-specified focus areas override this order
2. MAKE ONE CHANGE to eval/experiments/<name>/SKILL.md
- Target the specific weakness identified
- One focused change per iteration, not a rewrite
3. EVALUATE via A/B comparison (do reasoning internally, log to experiment-log.md):
a. Generate 3 representative user prompts for this skill
(realistic things a non-technical founder would say)
b. For each prompt, reason through what guidance the ORIGINAL
skill would produce vs what the MODIFIED version would produce
c. Judge: "Which version gives a non-technical founder better,
more actionable guidance?" Score: Original wins / Modified wins / Tie
d. Majority wins across the 3 prompts
Write the full A/B reasoning to eval/experiments/<name>/experiment-log.md.
Do NOT print it to the terminal. See Output Rules.
4. DECIDE
- Modified wins majority → KEEP the change
- Original wins or tie → DISCARD (revert experiment file to previous version)
5. LOG to eval/results.tsv:
experiment# skill tier dimension_targeted score_before score_after status change_description
6. RE-SCORE the full rubric after every KEEP (do this internally, don't print)
7. PRINT one line per experiment (see Output Rules)
8. CHECK stopping condition
Stop the main loop when ANY of:
After the main loop stops, run one final experiment targeting all Tier 3 items at once:
P polish KEEP|DISCARD score→score summaryThe polish pass does not change the /24 loop score. It's shown as a separate line at the end.
Be honest. You are both the improver and the judge — this only works if you don't fool yourself.
When the loop stops, generate eval/winners-report.md:
# Skill Improvement Results — [skill name] — [date]
## Summary
[skill]: [before]/24 → [after]/24 (+[delta]) | [N] kept, [N] discarded, [N] skipped | [N] experiments + polish
## What Changed
- [Plain English bullet 1 — net effect, not experiment sequence]
- [Plain English bullet 2]
- ...
## Rubric Scores
### Tier 1 — Critical
| Dimension | Before | After |
|-----------|--------|-------|
| Trigger precision | 1 | 3 |
| Trigger phrases | 1 | 3 |
| Actionable checklists | 2 | 3 |
| "Tell AI:" prompts | 2 | 3 |
### Tier 2 — Important
| Dimension | Before | After |
|-----------|--------|-------|
| Boundary clarity | 1 | 2 |
| Tool-specific guidance | 2 | 3 | ← or N/A for knowledge skills
| Founder perspective | 2 | 3 |
| Common mistakes | 1 | 2 |
### Tier 3 — Polish
- [x] Conciseness
- [x] Progressive disclosure
- [ ] Scannable
- [x] Cross-references
## Experiment Log
| # | Dimension | Result | Score | Change |
|---|-----------|--------|-------|--------|
| 1 | trigger-precision | KEEP | 12→15 | Narrowed description... |
| 2 | trigger-phrases | KEEP | 15→17 | Added natural phrases... |
| ... | ... | ... | ... | ... |
| P | polish | KEEP | — | Added cross-refs... |
## Review
The improved version is at: eval/experiments/[skill]/SKILL.md
To accept: "promote [skill]"
To compare: "show me the diff for [skill]"
To reject: "discard [skill]"
When the user says "promote" or "accept":
skills/<name>/SKILL.md with the improved version. The original is backed up at eval/experiments/<name>/SKILL.md.original. Proceed?"eval/experiments/<name>/SKILL.md.original as a backup name (rename the original backup)eval/experiments/<name>/SKILL.md → skills/<name>/SKILL.mdeval/experiments/<name>/SKILL.md.original."When the user says "discard":
skills/ was never touched.skills/<name>/SKILL.md during the loop. Only write to eval/experiments/.The terminal is not a research paper. Print only what the user needs to see.
The user is not reading a log file. When you speak between experiments — explaining the baseline, describing what you're focusing on, or summarizing results — use plain English.
Do:
Don't:
The rubric is your internal tool. The user wants to know: what's wrong with this skill, what are you fixing, and did it get better. If you can't explain it without jargon, you don't understand it well enough.
Print exactly one line per experiment. No diffs. No multi-line blocks.
1 trigger-precision KEEP 12→15 Narrowed description to exclude neighbor skill prompts
2 trigger-phrases KEEP 15→17 Added "my app is slow," "hosting bill too high"
3 actionable-checklists KEEP 17→19 Added 6-step workflow checklist at top
4 boundary-clarity KEEP 19→21 Routes out-of-scope to build/debug/monitor
5 tool-specific-guidance SKIP 21 Already scoring 3
6 founder-perspective DISCARD 21 Rewrite didn't improve A/B results
P polish KEEP 21→21 Added cross-refs, trimmed verbose sections
Format: experiment# dimension KEEP|DISCARD|SKIP score[→score] one-line summary
KEEP = experiment won A/B, change retainedDISCARD = experiment lost A/B, change revertedSKIP = dimension already scoring 3, no experiment runP = polish pass (Tier 3 batch, after main loop)DO NOT print: A/B reasoning, rubric re-scoring, prompt details, file diffs, edit explanations.
DO log to file: Full A/B reasoning, prompts, and rubric details go to eval/experiments/<name>/experiment-log.md. This is the audit trail if the user wants to review later.
Print a compact summary, then offer promote/discard:
optimize: 12/24 → 21/24 (+9) | 4 kept, 1 discarded, 1 skipped | 6 experiments + polish
What changed:
- Description now triggers on "my app is slow," "hosting bill too high"
- Routes out-of-scope requests to build/debug/monitor/database
- Speed audit split: Claude Code runs directly vs Lovable/Replit manual measure
- 6-step workflow checklist at top
- Related Skills section linking to 5 adjacent skills
Diff: "show me the diff"
Accept: "promote optimize" | Reject: "discard optimize"
"What changed" is plain English grouped by theme — not by experiment number, not line-by-line diffs. Summarize the net effect, not the sequence of changes.
The detailed winners report is still written to eval/winners-report.md for audit, but the terminal shows only this compact summary.