From claude-evolve
Conducts Socratic interview with ambiguity scoring across goal, program, evaluation, and constraints to generate initial.py, evaluate.py, config.json for evolution tasks.
npx claudepluginhub samuelzxu/claude-evolve --plugin claude-evolveThis skill uses the workspace's default tool permissions.
Ouroboros-inspired Socratic interview that refuses to start an expensive evolution run until the task specification is mathematically clear. Asks targeted questions across four dimensions, scores ambiguity after every answer, and only crystallizes into concrete artifacts (`initial.py`, `evaluate.py`, `config.json`) once ambiguity drops below 20%.
Runs autonomous evolutionary optimization on code blocks marked in seed programs using Claude models for mutations and user-defined evaluators for fitness scoring. Use for performance tuning.
Starts, monitors, or rewinds evolutionary development loops that iteratively refine ontologies and criteria using Ouroboros MCP tools until convergence. For evolving complex project specs.
Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.
Share bugs, ideas, or general feedback.
Ouroboros-inspired Socratic interview that refuses to start an expensive evolution run until the task specification is mathematically clear. Asks targeted questions across four dimensions, scores ambiguity after every answer, and only crystallizes into concrete artifacts (initial.py, evaluate.py, config.json) once ambiguity drops below 20%.
When this skill is invoked, immediately execute the workflow below. Do not only restate or summarize these instructions back to the user.
Evolutionary code discovery is expensive. Each generation burns real LLM calls and wall-clock time. If the fitness function is ambiguous, the wrong code region is marked mutable, or the evaluator is unreliable, the entire run produces garbage. This skill prevents that by forcing specification clarity before execution.
Analogous to oh-my-claudecode's deep-interview, but specialized for the four dimensions that actually matter for evolution:
initial.py / evaluate.py yetinitial.py with EVOLVE-BLOCK markers AND a working evaluate.py → run /evolve directly/evolve-status/evolve-installExplore subagent BEFORE asking the user about themAmbiguity score computed as:
ambiguity = 1 - (goal * 0.30 + program * 0.25 + evaluation * 0.30 + constraints * 0.15)
Each sub-score is in [0.0, 1.0]:
| Dimension | Weight | What "1.0" looks like |
|---|---|---|
| Goal Clarity | 0.30 | A single scalar fitness function stated in one sentence. No qualifiers. |
| Program Clarity | 0.25 | Exact file path + exact function/region to mark mutable. EVOLVE-BLOCK boundaries known. |
| Evaluation Clarity | 0.30 | Concrete command that runs real code and returns a number. Deterministic or well-averaged. |
| Constraint Clarity | 0.15 | Hard bounds on what must not change, time budget, correctness checks. |
Threshold: 0.20 ambiguity (i.e. 0.80 total clarity) before crystallization.
Explore agent to map the relevant area and store codebase_context. Examples of what to extract: function signatures, test files, existing benchmarks, language/runtime, external deps.state/interview-state.json:{
"active": true,
"phase": "interview",
"interview_id": "<uuid-or-timestamp>",
"type": "greenfield|brownfield",
"initial_idea": "<user input>",
"rounds": [],
"current_ambiguity": 1.0,
"threshold": 0.20,
"codebase_context": null,
"challenge_modes_used": [],
"ontology_snapshots": []
}
Starting evolution task interview. I'll ask targeted questions to understand what you want to evolve before committing to an expensive run. After each answer I'll show your clarity score. We'll proceed to spec crystallization once ambiguity drops below 20%.
Your idea: "<initial_idea>" Project type: <greenfield|brownfield> Current ambiguity: 100%
Repeat until ambiguity ≤ 0.20 OR user early-exits:
Select the dimension with the lowest clarity score. If tied, use weight order: Goal > Evaluation > Program > Constraints.
Question styles by dimension:
| Dimension | Question template | Example |
|---|---|---|
| Goal | "What single number defines success?" | "You said 'faster' — are we minimizing wall-clock seconds, CPU time, or something else?" |
| Program | "Which exact region is mutable?" | "I see three functions in solver.py. Which one is the algorithm you want to replace vs. fixed interface?" |
| Evaluation | "How do we measure that without asking an LLM?" | "Do you have a test suite we can run, or do we need a dedicated evaluator that runs the code on sample inputs?" |
| Constraints | "What must remain unchanged or bounded?" | "Is there a memory limit, runtime cap per evaluation, or an API surface that must stay stable?" |
If the scope feels conceptually fuzzy (user keeps redefining the target), switch to an ontology question before returning to normal dimensions:
"Across the last few rounds you've described this as X, Y, and Z. Which of those IS the thing we're evolving, and which are inputs or metrics?"
Use AskUserQuestion to present the question with clickable options when possible. Header format:
Round {n} | Targeting: {weakest_dim} | Why now: {one-sentence rationale} | Ambiguity: {pct}%
{question}
Options should be concrete candidates (e.g. three distinct fitness metric choices) plus "Other" for free text.
After the user answers, compute scores for all four dimensions.
Scoring rubric (apply to every dimension):
Be conservative. If you can write multiple different valid specifications from the user's answers, the score is not above 0.8.
Compute:
clarity = goal*0.30 + program*0.25 + evaluation*0.30 + constraints*0.15
ambiguity = 1 - clarity
Each round, extract the key entities mentioned (nouns with meaningful structure): Program, Evaluator, Input, Output, Metric, Constraint, Test, etc.
Track stability across rounds:
stable_entities — present in both current and previous roundschanged_entities — renamed (same type, >50% field overlap)new_entities — first seen this roundremoved_entities — in previous but not currentstability_ratio = (stable + changed) / total. Round 1 is N/A (no previous to compare).
Append the snapshot to state.ontology_snapshots[].
After scoring, show the user:
Round {n} complete.
| Dimension | Score | Weight | Weighted | Gap |
|-----------|-------|--------|----------|-----|
| Goal | {g} | 0.30 | {g*0.30} | {gap} |
| Program | {p} | 0.25 | {p*0.25} | {gap} |
| Evaluation | {e} | 0.30 | {e*0.30} | {gap} |
| Constraints | {c} | 0.15 | {c*0.15} | {gap} |
| **Total clarity** | | | **{total}** | |
| **Ambiguity** | | | **{1-total}** | |
**Ontology:** {n} entities | Stability: {ratio}% | New: {n} | Changed: {n} | Stable: {n}
**Next target:** {weakest} — {rationale}
Append the round to state.rounds[] and rewrite state/interview-state.json.
At specific rounds, inject a perspective shift into the question generation prompt. Each mode is used once and then normal questioning resumes.
CONTRARIAN MODE: Your next question should challenge the user's framing. Ask "what if the opposite were true?" or "what if the constraint you're treating as hard were actually negotiable?" Target the assumption most likely to be wrong.
Example: "You've said we must preserve the existing API surface. What if we didn't — would that unlock a dramatically better algorithm?"
SIMPLIFIER MODE: Your next question should probe whether the spec can be reduced. Ask "what's the smallest version that would still be valuable?" or "which dimension could we drop and still ship?" Look for aspirational complexity.
Example: "You mentioned wanting both speed and accuracy. If we optimized purely for speed with a minimum accuracy floor, would that still be useful?"
ONTOLOGIST MODE: Ambiguity is still high. We may be solving the wrong problem. Look at the ontology: {entities}. Ask "which entity is the core thing we're evolving, and which are context or environment?" Force the user to pick a noun.
Example: "You've mentioned the parser, the AST, and the optimizer. Which one is the evolution target? The other two are constraints or fixtures."
Track used modes in state.challenge_modes_used[] to prevent repetition.
When ambiguity ≤ 0.20 (or early exit), generate the concrete artifacts:
initial.pyBased on the interview answers:
# EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END markersrun_experiment(**kwargs) function as the stable interfaceIf the user provided an existing file, read it and insert EVOLVE-BLOCK markers around the region they identified. Do not rewrite code outside the markers.
evaluate.pyBased on the evaluation answers, produce a standalone evaluator that:
--program_path <path> as an argumentrun_experiment() (or the specified entry point)combined_score and correctImportant: The evaluator must NEVER use claude or any LLM to judge fitness. It must be pure code execution. If the user tried to specify an LLM-as-judge, reject it and ask for an objective metric.
Example template:
#!/usr/bin/env python3
"""Evaluator for <task description>."""
import argparse
import importlib.util
import json
import sys
import time
def load_program(path):
spec = importlib.util.spec_from_file_location("candidate", path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod
def main():
p = argparse.ArgumentParser()
p.add_argument("--program_path", required=True)
args = p.parse_args()
try:
mod = load_program(args.program_path)
# <user-specific: call run_experiment, validate, compute metric>
result = mod.run_experiment()
score = _compute_score(result)
correct = _validate(result)
except Exception as e:
print(json.dumps({"combined_score": 0.0, "correct": False, "error": str(e)}))
sys.exit(0)
print(json.dumps({
"combined_score": float(score),
"correct": bool(correct),
# <additional metrics>
}))
if __name__ == "__main__":
main()
config.jsonGenerate an EvolveConfig-compatible JSON. Set reasonable defaults based on the interview:
task_description: built from the Goal and Constraints answersinit_program_path: "initial.py"eval_program_path: "evaluate.py"num_generations: default 50; user can overrideensemble.arms: default ["sonnet/medium", "sonnet/low", "haiku/high", "haiku/medium", "haiku/low"] (skip sonnet/high and above unless the user explicitly wants extended thinking — they're slow and often exceed token limits)patches.types: ["diff", "full", "cross", "fix"] with default probsislands.num_islands: 2 (default) or 3 for larger searchnovelty.similarity_threshold: 0.95llm_timeout: 300 (5 min) — generous for medium effort, not so generous that failures waste hourseval_timeout: derived from the user's stated time budget per evaluation (default 120)Display:
✅ Spec crystallized (ambiguity: {pct}%)
Files written:
- initial.py ({lines} lines, EVOLVE-BLOCK marks {startline}-{endline})
- evaluate.py ({lines} lines)
- config.json ({gens} generations, {arms} bandit arms, {islands} islands)
Baseline fitness: run `python3 evaluate.py --program_path initial.py` to verify
Next step: run `/evolve` to start the evolutionary loop.
Then run the baseline evaluation yourself (via Bash) and report the baseline score so the user can see what they're starting from.
After the spec is ready, use AskUserQuestion to present:
Question: "Spec ready. How would you like to proceed?"
Options:
/evolve with the generated config (recommended)On option 1, invoke the evolve_start MCP tool (or equivalent) with the generated config path.
AskUserQuestion for every interview question — one at a time, never batchAgent(subagent_type="Explore") (haiku, short timeout) for brownfield codebase exploration BEFORE asking the userWrite to create initial.py, evaluate.py, config.json in Phase 4Bash to run the baseline evaluation in Phase 4dSkill("claude-evolve:evolve") or the evolve_start MCP tool to hand off in Phase 5Write to state/interview-state.json after every round. On resume, re-read the file and continue from the last completed round. If the user edits artifacts manually between sessions, detect that and offer to re-score with the new artifacts in mind.
initial.py written with EVOLVE-BLOCK markers around the mutable regionevaluate.py written and returns JSON with combined_score and correctconfig.json written and loads correctly via EvolveConfig.from_json()Round 3 | Targeting: Evaluation | Why now: we know the fitness metric
(sum of radii) and the mutable region (the packing algorithm), but
haven't specified HOW we measure it — ambiguity of the evaluator is
our biggest remaining gap. | Ambiguity: 38%
Do you have a test suite that validates circle packings, or should
I generate a dedicated evaluator that runs the algorithm once and
computes the sum of radii with overlap/boundary checks?
[explore agent finds: solver.py has parse(), plan(), and execute()]
Round 2 | Targeting: Program | Why now: you said "evolve the planner"
but solver.py has three functions — we need to know exactly which to
wrap in EVOLVE-BLOCK markers. | Ambiguity: 60%
I see three functions in solver.py: parse() (130 lines),
plan() (340 lines), and execute() (45 lines). Is `plan()` the
algorithm we're evolving, with parse() and execute() as fixed
interface?
"What's the fitness metric, and what's the mutable region, and how
long per evaluation, and how many generations should we run?"
Four questions at once → shallow answers, inaccurate scoring.
User: "Have Claude rate the code quality on a scale of 1 to 10."
Interviewer: (accepts this)
Wrong. Push back and explain that claude-evolve requires real code execution for fitness. Ask for an objective metric: does the code pass tests, produce correct output, run faster, use less memory, etc.