Auto-improve any agent prompt using Karpathy's autoresearch method. Runs iterative test-measure-improve loops on agent prompts to systematically increase quality. Triggers on: "autoresearch", "auto-improve", "optimize agent", "tune prompt", "improve skill quality".
From super-devnpx claudepluginhub jenningsloy318/claude-skill-artifacts --plugin super-devThis skill uses the workspace's default tool permissions.
Based on Andrej Karpathy's autoresearch method. Instead of manually improving agent prompts, let an AI agent do it in an iterative loop: try a small change, score the result, keep improvements, revert regressions.
┌─────────────────────────────────────────────────────┐
│ AUTORESEARCH LOOP │
│ │
│ 1. Run agent on test input │
│ 2. Score output against checklist (yes/no) │
│ 3. Record baseline score │
│ 4. Analyze weakest checklist items │
│ 5. Make ONE small change to the agent prompt │
│ 6. Re-run agent on same test input │
│ 7. Score again │
│ 8. If score improved → KEEP change │
│ If score dropped → REVERT change │
│ 9. Repeat from step 4 until target score reached │
│ │
│ Stop condition: 95%+ score 3 times in a row │
│ OR max rounds reached (default: 10) │
└─────────────────────────────────────────────────────┘
/super-dev:autoresearch
Arguments:
--agent <agent-name> Agent to optimize (e.g., "code-reviewer", "qa-agent")
--test-input <prompt> Test input to run the agent on
--rounds <N> Max improvement rounds (default: 10)
User specifies which agent prompt to improve. Read the agent's markdown file from agents/<name>.md.
Ask the user (or auto-generate from the agent's existing quality gates) a set of 3-6 yes/no scoring questions. Each question checks one specific aspect of the agent's output.
Good checklist questions (yes/no only):
Bad checklist questions (avoid):
Run the agent on the test input 3 times. Score each run against the checklist. The average score is the baseline.
Baseline: 5/8 checks passing = 62.5%
For each round:
When done (target reached or max rounds), produce:
# Autoresearch Results: [agent-name]
## Summary
- **Baseline score:** 62.5% (5/8)
- **Final score:** 93.75% (7.5/8)
- **Rounds:** 6
- **Changes kept:** 4
- **Changes reverted:** 2
## Changelog
### Round 1: Added gotchas section ✓ KEPT
- **Score:** 62.5% → 75%
- **Change:** Added "Gotchas" section listing 6 common failures
- **Why:** Checklist item "identifies production-risk bugs" was failing
- **Effect:** Bug identification improved in 2/3 test runs
### Round 2: Added explicit timezone rule ✓ KEPT
- **Score:** 75% → 81.25%
- **Change:** Added rule "Always flag timezone-naive datetime operations"
- **Why:** Checklist item "catches time-related bugs" was failing
### Round 3: Reduced prompt verbosity ✗ REVERTED
- **Score:** 81.25% → 75%
- **Change:** Removed 3 paragraphs of explanation, kept only rules
- **Why:** Hypothesized less text = more focused output
- **Effect:** Quality dropped, model needed the context
[...]
## Files
- **Original:** agents/[name].md.backup
- **Improved:** agents/[name].md
- **Results log:** ${CLAUDE_PLUGIN_DATA}/autoresearch/[name]-results.json
Store results in ${CLAUDE_PLUGIN_DATA}/autoresearch/:
${CLAUDE_PLUGIN_DATA}/autoresearch/
├── code-reviewer-results.json
├── code-reviewer-changelog.md
├── qa-agent-results.json
└── qa-agent-changelog.md
Good changes (one at a time):
Bad changes (avoid):
The autoresearch skill is a meta-tool: it improves the tools that build your software. Run it periodically on agents that produce inconsistent results.
Recommended schedule:
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.