English | 한국어
deep-evolve
Autonomous experimentation plugin for Claude Code. Specify a goal, and deep-evolve systematically improves your project through measured experiment loops.
Inspiration
This project is inspired by autoresearch by Andrej Karpathy — an experiment to have AI agents do their own research autonomously. The core idea: give an AI agent a codebase, let it experiment overnight — modifying code, evaluating results, keeping improvements, discarding regressions — and wake up to a better project.
The self-evolutionary architecture (v2.0) is inspired by HyperAgents — agents that evolve their own strategies through meta-learning, not just the target code.
deep-evolve generalizes this methodology from ML training to any software project, packaging it as a Claude Code plugin with automatic evaluation harness generation, journal-based crash recovery, multi-domain template support, and self-evolutionary strategy evolution.
Role in Harness Engineering
deep-evolve operates outside the standard Harness Engineering framework — it is an autonomous experimentation protocol that iteratively improves code through measured experiment loops. While the framework focuses on guiding and sensing during normal development, deep-evolve represents a complementary approach: using automated experimentation to discover improvements that no guide or sensor would suggest. It is part of the Deep Suite ecosystem but follows its own experiment→evaluate→keep/discard cycle.
With v2.0's Outer Loop, deep-evolve goes further: it not only improves the target code but also evolves the strategy that drives experiments — and can even expand the evaluation harness itself when convergence is detected. This 3-layer self-evolution (parameters → strategy text → evaluation expansion) makes the system a true meta-optimizer that improves its own improvement process.
Self-Evolutionary Experiment Loop (v2.0)
v2.0 introduces a self-evolutionary architecture where the system not only improves target code but also evolves the strategy that drives experiments.
2-Tier Architecture: Outer Loop + Inner Loop
┌─────────────────────────────────────────────────────────────┐
│ Outer Loop (Strategy Evolution) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ strategy.yaml: evolvable strategy parameters │ │
│ │ (mutation_rate, idea_bank, focus_areas, ...) │ │
│ └───────────────────┬─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Inner Loop (Experiment Execution) │ │
│ │ Run N experiments with current strategy │ │
│ │ → measure Q(v) meta-metric │ │
│ └───────────────────┬─────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Evaluate & Evolve Strategy │ │
│ │ Q(v) = (best_score - baseline) / experiments_used │ │
│ │ → mutate strategy.yaml for next outer iteration │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
- Inner Loop: The original experiment cycle — modify code, evaluate, keep/discard. Now driven by
strategy.yaml parameters.
- Outer Loop: Evolves the strategy itself. After each inner loop epoch, measures Q(v) (improvement velocity) and mutates strategy parameters to find better experimental approaches.
3-Layer Self-Evolution
deep-evolve evolves at three layers simultaneously:
| Layer | File | What Evolves | How |
|---|
| Parameters | strategy.yaml | Mutation rate, focus areas, idea bank | Outer Loop mutates per epoch |
| Strategy Text | program.md | Agent instructions, experiment approach | Meta Analysis auto-revises on convergence |
| Evaluation | prepare.py | Scenarios, difficulty, coverage | Section D auto-triggers on plateau |
Strategy & Code Archives (Stepping Stones)
Every strategy that achieves a new best Q(v) is archived as a stepping stone: