Performance subskill for /evolve - optimizes runtime speed for algorithms
Optimizes algorithm runtime speed using population-based evolution with statistical rigor and generalization testing.
/plugin marketplace add asermax/claude-plugins/plugin install superpowers@asermax-plugins<problem description>This is the performance optimization subskill for the /evolve command. It evolves algorithms for runtime speed (ops/sec, latency, throughput).
Note: This subskill is invoked by the master /evolve skill when performance mode is detected. You can also invoke it directly with /evolve-perf.
These requirements are non-negotiable and must be enforced by the evolution loop:
Three-way split: Every candidate MUST be evaluated on TRAIN + VALID + HOLDOUT/TEST datasets.
Selection vs Promotion:
Determinism Requirements:
--release, no debug builds during eval)Data Integrity:
Generalization Testing (required for promotion to champion):
To prevent overfitting to a single distribution, evolved algorithms must generalize:
For each problem domain, generate test data from multiple distributions:
| Problem Type | Required Distributions |
|---|---|
| Bin Packing | Weibull, Uniform, Normal, Bimodal, Power-law |
| Sorting | Random, Nearly-sorted, Reversed, Few-unique, Pipe-organ |
| String Search | English text, Random bytes, Repetitive, DNA sequences, Log files |
| Numeric | Uniform, Gaussian, Exponential, Heavy-tail, Clustered |
Store in evolution.json:
"generalization": {
"distributions": [
{"name": "weibull", "params": {"k": 5, "lambda": 50}, "weight": 1.0},
{"name": "uniform", "params": {"min": 1, "max": 100}, "weight": 1.0},
{"name": "normal", "params": {"mean": 50, "std": 15}, "weight": 1.0},
{"name": "bimodal", "params": {"peaks": [25, 75], "std": 10}, "weight": 1.0},
{"name": "power_law", "params": {"alpha": 2.0}, "weight": 1.0}
],
"promotion_threshold": 0.6,
"report_per_distribution": true
}
A candidate can only become champion if:
def passes_generalization_gate(candidate, baseline, distributions):
wins = 0
for dist in distributions:
if candidate.eval(dist) < baseline.eval(dist):
wins += 1
# Must win on majority of distributions
return wins >= len(distributions) * 0.6
Each generation log includes per-distribution breakdown:
{
"id": "gen4_hybrid_balanced",
"generalization": {
"weibull": {"excess": 0.5735, "vs_baseline": "-16.2%", "status": "WIN"},
"uniform": {"excess": 0.4821, "vs_baseline": "-12.1%", "status": "WIN"},
"normal": {"excess": 0.5102, "vs_baseline": "-14.8%", "status": "WIN"},
"bimodal": {"excess": 0.6234, "vs_baseline": "-8.3%", "status": "WIN"},
"power_law": {"excess": 0.7102, "vs_baseline": "+2.1%", "status": "LOSS"},
"summary": "4/5 distributions improved (80%)"
}
}
Flag candidates that show signs of overfitting:
A candidate is accepted into the population only if ALL of the following hold:
TRAIN improvement: Candidate improves mean TRAIN objective by at least ε (epsilon).
--epsilon <value> or in evolution.jsonVALID non-regression: Candidate must not regress on VALID mean.
Instance consistency (at least ONE must hold):
Noise handling:
Correctness: Must pass all correctness tests (implicit, always required)
Statistical confidence (for timing-sensitive benchmarks):
For benchmarks where variance matters (timing, throughput), apply statistical tests:
def evaluate_with_confidence(candidate, n_runs=5):
results = [run_benchmark(candidate) for _ in range(n_runs)]
return {
"mean": statistics.mean(results),
"std": statistics.stdev(results),
"median": statistics.median(results),
"min": min(results),
"max": max(results),
"runs": results
}
Report all metrics with confidence intervals:
Candidate: gen4_hybrid_balanced
Performance (5 runs):
Excess: 0.5735% ± 0.012% (95% CI: 0.561% - 0.586%)
Bins: 9996.2 ± 1.3
Baseline: funsearch
Excess: 0.6842% ± 0.015% (95% CI: 0.669% - 0.699%)
Bins: 10007.4 ± 1.8
Improvement: 16.2% ± 2.1% (statistically significant, p < 0.01)
from scipy import stats
def is_significant_improvement(candidate_runs, baseline_runs, alpha=0.05):
"""Two-sample t-test for improvement significance"""
t_stat, p_value = stats.ttest_ind(candidate_runs, baseline_runs)
# One-sided test: candidate < baseline (lower is better)
p_one_sided = p_value / 2 if t_stat < 0 else 1 - p_value / 2
return {
"significant": p_one_sided < alpha,
"p_value": p_one_sided,
"confidence": 1 - alpha,
"effect_size": (mean(baseline_runs) - mean(candidate_runs)) / pooled_std
}
For deterministic benchmarks (exact bin counts, not timing):
def needs_statistical_testing(benchmark_type):
deterministic = ["bin_packing", "sorting_correctness", "exact_count"]
stochastic = ["throughput", "latency", "timing", "ops_per_second"]
return benchmark_type in stochastic
// evolution.json acceptance config
"acceptance": {
"epsilon": 0.001,
"valid_regression_tolerance": 0.005,
"instance_threshold_ratio": 0.6,
"noise_rerun_count": 3,
"noise_multiplier": 2.0
}
Each generation MUST output structured explanations that enable learning from failures:
### Candidate: gen3_simd_radix
**Hypothesis** (1-2 sentences):
SIMD-parallel bucket counting will reduce memory stalls during radix distribution phase.
**Prediction** (specific, measurable):
- Expect 10-20% improvement on large arrays (n > 10000)
- Expect minimal change on small arrays (n < 1000)
- Expect largest gains on uniformly distributed inputs
**Evidence** (per-instance before/after):
| Instance | Baseline Bins | Candidate Bins | Baseline L1 | Candidate L1 | Baseline Excess | Candidate Excess | Delta |
|-------------|---------------|----------------|-------------|--------------|-----------------|------------------|-------|
| train_0 | 2012 | 2008 | 1998 | 1998 | 0.70% | 0.50% | -0.20 |
| train_1 | 1983 | 1985 | 1975 | 1975 | 0.40% | 0.51% | +0.11 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| **TRAIN μ** | 1998.2 | 1996.1 | 1987.8 | 1987.8 | 0.52% | 0.42% | -0.10 |
| **VALID μ** | 2001.4 | 2000.8 | 1990.2 | 1990.2 | 0.56% | 0.53% | -0.03 |
**Decision**: KEEP
- Prediction confirmed: TRAIN improved by 0.10% (> ε=0.001)
- VALID non-regression: 0.53% vs 0.56% (improved)
- Instance check: 4/5 TRAIN instances improved
**Next Mutation Plan** (one concrete change):
Try 8-bit vs 11-bit radix to find optimal bucket count for cache efficiency.
**Decision**: DROP
- Prediction falsified: Expected 10-20% gain, observed 2% regression
- Hypothesis update: SIMD overhead dominates for this data size distribution
- Learning: Skip SIMD variants unless median instance size > 50000
The evolution loop MUST maintain population diversity using at least ONE of these mechanisms:
islands = {
"exploitation": {"focus": "refine_champion", "mutation_rate": 0.1},
"exploration": {"focus": "radical_changes", "mutation_rate": 0.5},
"hybrid": {"focus": "crossover_only", "mutation_rate": 0.0}
}
# Migration: Every 3 generations, copy best from each island to others
Maintain a "probe set" of diverse inputs. Score candidates on behavioral novelty:
def novelty_score(candidate, archive):
behavior = [candidate.eval(probe) for probe in probe_set]
distances = [euclidean(behavior, arch.behavior) for arch in archive]
return mean(sorted(distances)[:k]) # k-nearest novelty
Define 2-3 behavioral dimensions and maintain best-per-bucket:
buckets = {
"complexity": ["simple (<10 ops)", "medium (10-50 ops)", "complex (>50 ops)"],
"strategy": ["comparison-based", "distribution-based", "hybrid"],
"specialization": ["general", "small-input", "large-input"]
}
# Keep best candidate in each bucket; crossover draws from different buckets
Population of 4 must contain at least 2 distinct algorithm_family values. If diversity drops below threshold, force exploration:
Track and report diversity metrics every generation:
"diversity": {
"algorithm_families": {
"harmonic": 2,
"geometric": 1,
"polynomial": 1
},
"unique_families": 3,
"diversity_score": 0.75,
"status": "HEALTHY",
"actions_taken": []
}
def calculate_diversity_score(population):
"""Score from 0 (all same) to 1 (all different)"""
families = [p.algorithm_family for p in population]
unique = len(set(families))
total = len(families)
# Shannon entropy normalized to [0,1]
from collections import Counter
import math
counts = Counter(families)
entropy = -sum((c/total) * math.log2(c/total) for c in counts.values())
max_entropy = math.log2(total)
return entropy / max_entropy if max_entropy > 0 else 0
When selecting new population:
def select_with_diversity(candidates, population_size=4):
selected = []
# 1. Always keep the champion
selected.append(candidates[0])
# 2. Add best from each unique family
families_seen = {candidates[0].algorithm_family}
for c in candidates[1:]:
if c.algorithm_family not in families_seen:
selected.append(c)
families_seen.add(c.algorithm_family)
if len(selected) >= population_size:
break
# 3. Fill remaining with best performers
for c in candidates[1:]:
if c not in selected:
selected.append(c)
if len(selected) >= population_size:
break
return selected[:population_size]
When diversity drops below threshold:
⚠️ Diversity Alert (Gen 5)
Population has converged to single algorithm family: "harmonic"
Diversity score: 0.25 (threshold: 0.50)
Automatic action: Spawning 2 alien mutations in Gen 6:
- alien_geometric: Try geometric-based approach
- alien_polynomial: Try polynomial-based approach
Rationale: Maintaining diversity prevents premature convergence
and enables discovering novel hybrid combinations.
Evolved algorithms must remain simple and efficient:
priority() / core function must be O(1) per element (no nested loops over input)Cap maximum complexity to prevent overfitting:
complexity_limits = {
"max_ast_nodes": 50, # Approximate limit on expression tree size
"max_operations": 30, # +, -, *, /, pow, sqrt, ln, exp, abs, min, max
"max_branches": 5, # if/else, match arms
"max_constants": 10, # Magic numbers
"max_nested_depth": 4 # Expression nesting depth
}
Prefer simpler formulations (use as tiebreaker when fitness is equal):
a * x + b over a * x^2 + b * x + cln(x), sqrt(x) over piecewise/discontinuousFlag and penalize:
Write to .evolve/<problem>/generations.jsonl (append-only):
{"gen": 1, "timestamp": "2024-12-26T10:30:00Z", "candidates": [...], "champion_id": "gen1_log", "train_best": 0.5836, "valid_best": 0.5901, "test_best": 0.5842}
{"gen": 2, "timestamp": "2024-12-26T10:35:00Z", "candidates": [...], "champion_id": "gen1_log", "train_best": 0.5836, "valid_best": 0.5901, "test_best": 0.5842}
Each candidate entry in the JSONL:
{
"id": "gen3_simd_radix",
"parent_ids": ["gen2_radix", "gen1_quicksort"],
"mutation_type": "crossover",
"git_diff_hash": "a1b2c3d4",
"code_path": "mutations/gen3_simd_radix.rs",
"metrics": {
"train": {"mean": 0.5720, "median": 0.5650, "std": 0.0234, "per_instance": [...]},
"valid": {"mean": 0.5834, "median": 0.5801, "std": 0.0198, "per_instance": [...]},
"test": {"mean": 0.5756, "median": 0.5712, "std": 0.0201, "per_instance": [...]}
},
"acceptance": {
"result": "KEEP",
"train_improvement": 0.0116,
"valid_regression": false,
"instances_improved": "4/5"
},
"explanation": {
"hypothesis": "SIMD-parallel bucket counting reduces memory stalls",
"prediction": "10-20% improvement on large arrays",
"outcome": "Confirmed: 11.6% improvement on TRAIN"
}
}
Maintain .evolve/<problem>/champion.json:
{
"id": "gen3_simd_radix",
"generation": 3,
"discovered_at": "2024-12-26T10:35:00Z",
"code_path": "rust/src/evolved.rs",
"metrics": {
"train": 0.5720,
"valid": 0.5834,
"test": 0.5756
},
"lineage": ["baseline", "gen1_radix", "gen2_radix", "gen3_simd_radix"],
"key_innovations": ["11-bit radix", "SIMD bucket count"],
"reproduce_command": "cd .evolve/bin-packing/rust && cargo run --release --bin benchmark"
}
Log exact commands to reproduce any state:
# Logged in generations.jsonl
"reproduce": {
"setup": "git checkout a1b2c3d4 && cd .evolve/bin-packing/rust",
"build": "~/.cargo/bin/cargo build --release",
"eval": "~/.cargo/bin/cargo run --release --bin benchmark -- --seed 42",
"expected_output_hash": "sha256:abc123..."
}
Never edit evaluator AND candidate in same step
One change at a time
Automatic revert on regression
Preserve lineage
Pre-flight validation
Checkpoint before risky operations
Bounds on evolution
Evolution runs involve executing untrusted, LLM-generated code. These requirements ensure mutations don't interfere with each other and minimize security risks.
CRITICAL: Each mutation MUST execute in complete isolation to prevent race conditions and result corruption.
.evolve/<problem>/
├── workspace/ # Ephemeral mutation workspaces
│ ├── gen3_mut0_abc123/ # Unique per mutation (gen + index + hash)
│ │ ├── rust/ # Complete copy of benchmark code
│ │ ├── results.json # This mutation's results only
│ │ └── .lock # Mutex file for this workspace
│ ├── gen3_mut1_def456/
│ └── ...
├── rust/ # Template/champion code (READ-ONLY during eval)
├── mutations/ # Archive of all tested mutations
└── results/ # Aggregated results (written AFTER all evals)
└── gen3_results.json
Unique workspace per mutation:
def create_mutation_workspace(generation, mutation_index, code_hash):
workspace_id = f"gen{generation}_mut{mutation_index}_{code_hash[:6]}"
workspace_path = f".evolve/{problem}/workspace/{workspace_id}"
# Copy template, don't symlink
shutil.copytree(template_path, workspace_path)
return workspace_path
No shared files during evaluation:
workspace/gen_X_mutY_*/ directoryrust/src/evolved.rs during parallel evaluationAtomic result collection:
def collect_results(generation, workspaces):
results = []
for ws in workspaces:
# Read each mutation's isolated results
result_file = f"{ws}/results.json"
if os.path.exists(result_file):
with open(result_file) as f:
results.append(json.load(f))
# Write aggregated results atomically
with atomic_write(f"results/gen{generation}_results.json") as f:
json.dump(results, f)
return results
Cleanup after aggregation:
def cleanup_workspaces(generation):
# Only clean up after results are safely aggregated
workspace_dir = f".evolve/{problem}/workspace"
for ws in glob.glob(f"{workspace_dir}/gen{generation}_*"):
shutil.rmtree(ws)
When running mutations in parallel via Task agents:
# CORRECT: Each agent gets unique workspace
for i, mutation in enumerate(mutations):
workspace = create_mutation_workspace(gen, i, hash(mutation))
agent.run(
workspace=workspace,
evolved_rs=f"{workspace}/rust/src/evolved.rs", # Isolated
results_out=f"{workspace}/results.json" # Isolated
)
# WRONG: Shared paths cause race conditions
for mutation in mutations:
agent.run(
evolved_rs=".evolve/problem/rust/src/evolved.rs", # COLLISION!
results_out=".evolve/problem/results.json" # COLLISION!
)
Evolution must run unattended. Configure agents for autonomous operation:
When spawning mutation/evaluation agents via the Task tool:
Use dangerouslyDisableSandbox for Bash execution:
Constrain file access to workspace:
mutation_agent_prompt = f"""
You are evaluating a mutation in an ISOLATED workspace.
WORKSPACE: {workspace_path}
CONSTRAINTS:
- ONLY read/write files within {workspace_path}
- Do NOT access parent directories
- Do NOT access network resources
- Do NOT modify system files
Your task: [mutation/evaluation task]
"""
Pre-approve expected operations:
cargo build --release in workspacecargo run --release --bin benchmark in workspace.json and .rs files in workspaceAll mutation evaluations MUST have timeouts:
evaluation_limits = {
"compile_timeout_ms": 120000, # 2 minutes to compile
"benchmark_timeout_ms": 300000, # 5 minutes to benchmark
"total_mutation_timeout_ms": 600000 # 10 minutes total per mutation
}
If a mutation exceeds timeout:
Evolved code is untrusted. Minimize blast radius:
Deny unsafe by default (unless explicitly required):
# Cargo.toml for mutation workspaces
[profile.release]
# ... optimization settings ...
[lints.rust]
unsafe_code = "deny" # Override only if mutation explicitly uses unsafe
Limit dependencies:
# Only allow pre-approved dependencies
[dependencies]
# Core only - no network, no filesystem beyond std
Static analysis before execution:
def pre_execution_checks(evolved_rs_path):
code = open(evolved_rs_path).read()
# Block dangerous patterns
dangerous_patterns = [
r'std::process::Command', # No shell execution
r'std::fs::(remove|write)', # No filesystem modification outside workspace
r'std::net::', # No network access
r'std::env::', # No environment access
r'include_bytes!', # No external file inclusion
r'extern\s+"C"', # No FFI
]
for pattern in dangerous_patterns:
if re.search(pattern, code):
return False, f"Blocked: {pattern}"
return True, "OK"
Resource limits (if available on platform):
# Example: limit memory and CPU time
ulimit -v 2097152 # 2GB virtual memory
ulimit -t 300 # 5 minute CPU time
Working directory jail:
# Run benchmark from within workspace, not project root
subprocess.run(
["cargo", "run", "--release", "--bin", "benchmark"],
cwd=workspace_path, # Jail to workspace
timeout=300
)
Output sanitization:
Before starting a generation, verify:
- [ ] All mutation workspaces created with unique paths
- [ ] Template code copied (not symlinked) to each workspace
- [ ] No shared mutable state between mutations
- [ ] Timeouts configured for all operations
- [ ] Pre-execution security scan enabled
- [ ] Results aggregation happens AFTER all evals complete
After generation completes:
- [ ] All mutation results collected
- [ ] Results aggregated atomically
- [ ] Workspaces cleaned up
- [ ] Champion promoted only from aggregated results
- [ ] Workspace directory is empty (all cleaned up)
If a mutation agent crashes or times out:
{
"mutation_id": "gen3_mut5_abc123",
"status": "FAILED",
"reason": "timeout",
"workspace": "/path/to/workspace",
"preserved": true,
"investigation_needed": true
}
/evolve-perf <problem description>
/evolve-perf <problem description> --budget <tokens|generations>
/evolve-perf --resume # Continue previous evolution
Examples:
/evolve-perf sorting algorithm for integers
/evolve-perf bin packing heuristic --budget 50k
/evolve-perf string search --budget 20gen
/evolve-perf hash function --budget unlimited
/evolve-perf --resume
Budget Options:
| Budget | Meaning | Approx. Generations |
|---|---|---|
10k | 10,000 tokens | ~2-3 generations |
50k | 50,000 tokens | ~10-12 generations |
100k | 100,000 tokens | ~20-25 generations |
5gen | 5 generations | Fixed count |
unlimited | No limit | Until plateau |
| (none) | Default 50k | ~10-12 generations |
┌─────────────────────────────────────────────────────────────┐
│ Step 0: Bootstrap (first run only) │
│ Step 1: Discover baselines │
│ Step 2: Generate benchmark infrastructure │
│ Step 3: Establish baseline │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ EVOLUTION LOOP (adaptive) │ │
│ │ │ │
│ │ while budget_remaining AND improving: │ │
│ │ - Generation N: crossover + mutation │ │
│ │ - Evaluate offspring │ │
│ │ - Update population │ │
│ │ - Check stopping criteria │ │
│ │ - Checkpoint state │ │
│ │ - If plateau: ask user to continue? │ │
│ │ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Step Final: Report results │
└─────────────────────────────────────────────────────────────┘
Skip if .evolve/.bootstrapped exists.
Check Rust Toolchain:
~/.cargo/bin/cargo --version 2>/dev/null || cargo --version
If missing, offer installation via AskUserQuestion.
Check Python 3.10+:
python3 --version
Create directories and mark complete:
mkdir -p .evolve
touch .evolve/.bootstrapped
Before generating benchmarks, search for existing published results to establish state-of-the-art baselines.
Run these searches to find authoritative baselines:
search_queries = [
f"{problem} benchmark state of the art",
f"{problem} algorithm comparison paper",
f"{problem} best known result",
f"{problem} github benchmark rust",
f"{problem} competitive programming",
]
Use WebSearch for each query, then WebFetch to extract specific results.
For each discovered baseline, record:
"discovered_baselines": [
{
"name": "FunSearch",
"source": "Nature 2024",
"url": "https://www.nature.com/articles/s41586-023-06924-6",
"metric": "0.68% excess",
"benchmark": "Weibull 5k",
"verified": true,
"our_reproduction": "0.6842%"
},
{
"name": "Best Fit Decreasing",
"source": "Standard algorithm",
"metric": "~4% excess",
"benchmark": "Weibull 5k",
"verified": true
}
]
CRITICAL: Verify discovered baselines by implementing and running them:
baselines.rsBaseline Verification Report:
| Baseline | Published | Our Run | Match | Notes |
|----------|-----------|---------|-------|-------|
| FunSearch | 0.68% | 0.6842% | ✓ | Exact match |
| Best Fit | ~4% | 3.98% | ✓ | Within tolerance |
| First Fit | ~4% | 4.23% | ✓ | As expected |
If no published baselines exist:
Create in .evolve/<problem-name>/:
.evolve/<problem-name>/
├── rust/
│ ├── Cargo.toml
│ └── src/
│ ├── lib.rs # Trait definition
│ ├── baselines.rs # Known algorithms to beat
│ ├── evolved.rs # Current champion
│ └── benchmark.rs # Benchmark binary
├── data/
│ ├── train/ # Training instances (for selection)
│ ├── valid/ # Validation instances (for promotion gate)
│ └── test/ # Holdout instances (never used for selection)
├── evaluator.py # Fitness evaluation
├── evolution.json # Full evolution state (for resume)
├── champion.json # Best-so-far manifest
├── generations.jsonl # Per-generation log (append-only)
└── mutations/ # All tested mutations
This file enables resumption and tracks all evolution state:
{
"mode": "perf",
"problem": "sorting algorithm for integers",
"created": "2024-01-15T10:30:00Z",
"updated": "2024-01-15T11:45:00Z",
"reproducibility": {
"rust_toolchain": "1.75.0",
"git_commit": "a1b2c3d4e5f6",
"platform": "darwin-arm64",
"train_data_hash": "sha256:abc123...",
"valid_data_hash": "sha256:def456...",
"test_data_hash": "sha256:789ghi..."
},
"acceptance": {
"epsilon": 0.001,
"valid_regression_tolerance": 0.005,
"instance_threshold_ratio": 0.6,
"noise_rerun_count": 3
},
"problem_analysis": {
"algorithm_families": ["quicksort", "mergesort", "heapsort", "radix", "counting", "shell", "timsort", "introsort"],
"optimization_dimensions": ["cache", "simd", "branch_prediction", "small_array", "nearly_sorted", "memory"],
"viable_strategies": 26,
"gen1_agents": 26,
"gen2_agents": 16
},
"budget": {
"type": "tokens",
"limit": 50000,
"used": 23450,
"remaining": 26550
},
"generation": 5,
"status": "running",
"baseline": {
"naive": 1289,
"std": 114592,
"std_unstable": 168417
},
"champion": {
"id": "gen4_crossover_radix_shell",
"fitness": 0.94,
"ops_per_second": 185000,
"generation_discovered": 4,
"train_metric": 0.5720,
"valid_metric": 0.5834,
"test_metric": 0.5756
},
"population": [
{
"id": "gen4_crossover_radix_shell",
"fitness": 0.94,
"ops_per_second": 185000,
"algorithm_family": "hybrid_radix_shell",
"key_innovations": ["11-bit radix", "insertion base case", "gap presort"],
"parents": ["gen3_radix_quick", "gen2_shellsort"],
"code_path": "mutations/gen4_crossover_radix_shell.rs"
}
// ... 3 more
],
"history": [
{
"generation": 1,
"best_fitness": 0.89,
"best_ops": 156000,
"best_id": "gen1_radix",
"improvement": null,
"tokens_used": 4500
},
{
"generation": 2,
"best_fitness": 0.91,
"best_ops": 172000,
"best_id": "gen2_crossover_radix_quick",
"improvement": 0.02,
"tokens_used": 4200
},
{
"generation": 3,
"best_fitness": 0.91,
"best_ops": 173000,
"best_id": "gen2_crossover_radix_quick",
"improvement": 0.00,
"plateau_count": 1,
"tokens_used": 4100
}
],
"stopping": {
"plateau_count": 0,
"plateau_threshold": 3,
"min_improvement": 0.005,
"max_generations": null
}
}
Analyze the problem to estimate:
Example analysis for "sorting integers":
Algorithm families (8+):
- Comparison: quicksort, mergesort, heapsort, shellsort, timsort
- Distribution: radix sort, counting sort, bucket sort
- Hybrid: introsort, pdqsort, pattern-defeating
Optimization dimensions (6):
- Cache efficiency, SIMD, branch prediction, memory allocation
- Small-array specialization, nearly-sorted detection
Estimated viable strategies: 14
Recommended agents: 16 (Gen1), 12 (Gen2+)
def estimate_agents(problem_analysis):
# Base: number of distinct algorithm families
algo_families = len(problem_analysis["algorithm_families"])
# Add optimization dimensions (each can be applied to top algos)
opt_dimensions = len(problem_analysis["optimization_dimensions"])
# Viable strategies = families + (top_3_families × opt_dimensions)
viable_strategies = algo_families + min(3, algo_families) * opt_dimensions
# Gen1: Explore all viable strategies (cap at 32)
gen1_agents = min(viable_strategies, 32)
# Gen2+: Crossover pairs + mutations (scale with population)
gen2_agents = min(viable_strategies // 2 + 4, 24)
return {
"gen1_agents": gen1_agents,
"gen2_agents": gen2_agents,
"viable_strategies": viable_strategies
}
Based on analysis, recommend budget and ask user:
Evolution ready for: sorting algorithm for integers
Baselines:
bubble (naive): 1,289 ops/sec
std: 114,592 ops/sec
std_unstable: 168,417 ops/sec ← target
Problem Analysis:
Algorithm families: 8 (comparison, distribution, hybrid)
Optimization dimensions: 6
Viable strategies: 14
Recommended: 16 agents/gen, ~6k tokens/gen
Budget Options:
1. Quick (10k) - 2 gens, 8 agents each [minimal exploration]
2. Standard (50k) - 8 gens, 16 agents Gen1, 12 Gen2+ [Recommended]
3. Deep (100k) - 16 gens, 24 agents Gen1, 16 Gen2+ [thorough]
4. Maximum (200k) - 32 gens, 32 agents Gen1, 24 Gen2+ [exhaustive]
5. Unlimited - run until plateau
⚡ For this problem, Standard (50k) should explore most viable algorithms.
Deep (100k) recommended if you want thorough hybrid combinations.
| Problem | Algo Families | Opt Dims | Viable | Gen1 Agents | Gen2+ Agents |
|---|---|---|---|---|---|
| Fibonacci | 4 | 2 | 10 | 10 | 8 |
| Sorting | 8 | 6 | 26 | 26 | 16 |
| String search | 6 | 4 | 18 | 18 | 12 |
| Hash function | 10 | 5 | 25 | 25 | 16 |
| Integer parsing | 3 | 4 | 15 | 15 | 10 |
Store analysis and agent counts in evolution.json.
Tokens scale with agent count:
def estimate_tokens_per_gen(agent_count):
return agent_count * 400 + 800 # ~400 tokens/agent + overhead
# Examples:
# 8 agents → ~4,000 tokens/gen
# 16 agents → ~7,200 tokens/gen
# 32 agents → ~13,600 tokens/gen
Display budget status after every generation:
┌───────────────────────────────────────────────┐
│ Generation 4 Complete │
│ │
│ Budget: ████████████░░░░░░░░ 58% (29k/50k) │
│ │
│ This gen: 7,200 tokens (16 agents) │
│ Remaining: ~2-3 more generations │
└───────────────────────────────────────────────┘
"budget": {
"type": "tokens",
"limit": 50000,
"used": 29000,
"remaining": 21000,
"per_generation": [
{"gen": 1, "tokens": 8200, "agents": 16},
{"gen": 2, "tokens": 6800, "agents": 12},
{"gen": 3, "tokens": 7100, "agents": 12},
{"gen": 4, "tokens": 6900, "agents": 12}
],
"avg_per_gen": 7250,
"estimated_remaining_gens": 2.9
}
Display warnings at thresholds:
def check_budget_warnings(budget):
remaining_pct = budget.remaining / budget.limit * 100
if remaining_pct <= 10:
return "⚠️ CRITICAL: <10% budget remaining. Final generation."
elif remaining_pct <= 25:
return "⚠️ WARNING: <25% budget remaining. Consider stopping soon."
elif remaining_pct <= 50:
return "ℹ️ Budget 50% used. On track."
else:
return None
Spawn N parallel mutation agents where N = gen1_agents from problem analysis.
Generate strategy list dynamically based on problem:
def generate_strategies(problem_analysis, agent_count):
strategies = []
# 1. One agent per algorithm family
for family in problem_analysis["algorithm_families"]:
strategies.append(f"implement_{family}")
# 2. Optimization variants of top families
top_families = problem_analysis["algorithm_families"][:3]
for family in top_families:
for opt in problem_analysis["optimization_dimensions"]:
strategies.append(f"{family}_{opt}")
# 3. Fill remaining with general strategies
general = ["tweak", "unroll", "specialize", "vectorize",
"memoize", "restructure", "hybrid", "alien",
"simd", "branch_free", "cache_friendly", "unsafe_opt"]
while len(strategies) < agent_count:
strategies.extend(general)
return strategies[:agent_count]
Evaluate all, extract innovations, select top 4 with diversity.
Update evolution.json after each generation.
Crossover is REQUIRED, not optional. Each Gen2+ generation MUST include crossover candidates.
Each generation uses gen2_agents from problem analysis:
def allocate_gen2_agents(agent_count, population):
# MANDATORY: At least 50% crossover in Gen2+
crossover_count = max(agent_count // 2, 3) # Minimum 3 crossover
mutation_count = agent_count - crossover_count
# Crossover: pair top performers (ensure diverse pairings)
crossover_pairs = []
for i in range(len(pop)):
for j in range(i+1, len(pop)):
# Prefer pairing different algorithm families
if pop[i].algorithm_family != pop[j].algorithm_family:
crossover_pairs.append((pop[i], pop[j]))
# Fill with same-family pairs if needed
if len(crossover_pairs) < crossover_count:
for i in range(len(pop)):
for j in range(i+1, len(pop)):
if (pop[i], pop[j]) not in crossover_pairs:
crossover_pairs.append((pop[i], pop[j]))
crossover_pairs = crossover_pairs[:crossover_count]
# Mutation: apply diverse strategies to top performers
mutation_targets = population[:mutation_count]
return crossover_pairs, mutation_targets
Each crossover candidate MUST include:
{
"id": "gen4_hybrid_balanced",
"mutation_type": "crossover",
"parent_ids": ["gen1_harmonic", "gen3_geometric_mean"],
"parent_contributions": {
"gen1_harmonic": ["harmonic mean scoring", "coefficient 50"],
"gen3_geometric_mean": ["geometric mean term", "sqrt scaling"]
},
"novel_combination": "50/50 blend of harmonic and geometric signals"
}
Each generation:
After each generation, evaluate:
def should_stop(evolution_state):
# 1. Budget exhausted
if budget_used >= budget_limit:
return True, "budget_exhausted"
# 2. Plateau detected (no improvement for N generations)
if plateau_count >= plateau_threshold:
return True, "plateau"
# 3. Target achieved (if specified)
if champion_fitness >= target_fitness:
return True, "target_achieved"
# 4. Max generations (if specified)
if generation >= max_generations:
return True, "max_generations"
return False, None
Track improvement across generations:
def update_plateau_status(current_fitness, previous_fitness, min_improvement=0.005):
improvement = current_fitness - previous_fitness
if improvement < min_improvement:
plateau_count += 1
else:
plateau_count = 0 # Reset on improvement
return plateau_count
Plateau threshold: 3 generations without meaningful improvement (>0.5%)
When plateau detected OR every 5 generations with unlimited budget:
Generation 8 Complete:
Champion: 185K ops/sec (+14,254% vs bubble, +10% vs std_unstable)
⚠️ Plateau detected: No improvement for 3 generations
Budget used: 34,200 / 50,000 tokens (68%)
Options:
1. Continue evolution (may find breakthrough)
2. Stop and save champion
3. Try radical mutations only (higher variance)
Use AskUserQuestion to let user decide.
If improvement after plateau:
🎉 Breakthrough in Generation 9!
Previous best: 185K ops/sec
New champion: 201K ops/sec (+8.6%)
Innovation: SIMD-friendly radix bucket distribution
Plateau reset. Continuing evolution...
Write complete state to evolution.json:
When user runs /evolve-perf --resume:
evolution.json in .evolve/*/Resuming evolution: sorting algorithm for integers
Status: Paused at generation 5
Champion: 185K ops/sec (gen4_crossover_radix_shell)
Budget remaining: 26,550 / 50,000 tokens
Last 3 generations:
Gen 3: 173K ops/sec (plateau 1)
Gen 4: 185K ops/sec (breakthrough!)
Gen 5: 185K ops/sec (plateau 1)
Continue evolution?
When evolution stops (any reason):
Evolution Complete!
Problem: sorting algorithm for integers
Generations: 12
Total tokens: 48,750
Stop reason: plateau (3 generations without improvement)
Evolution Trajectory:
Gen 1: 156K ops/sec radix_sort discovered
Gen 2: 172K ops/sec radix+quicksort hybrid (+10%)
Gen 3: 173K ops/sec minor refinement (+0.5%)
Gen 4: 185K ops/sec breakthrough: added shellsort presort (+7%)
Gen 5: 185K ops/sec plateau
...
Gen 12: 189K ops/sec final refinement
Baselines:
bubble: 1.3K ops/sec (naive starting point)
std: 115K ops/sec
std_unstable: 168K ops/sec
Champion: 189K ops/sec
vs bubble: +14,538% (146x faster)
vs std_unstable: +12.5%
Metrics:
TRAIN: 0.5720
VALID: 0.5834 (no regression from baseline)
TEST: 0.5756 (holdout, for reference only)
Key Innovations in Champion:
- 11-bit radix buckets (Gen 1)
- Sign-bit flip for negatives (Gen 1)
- Insertion sort for n < 32 (Gen 4)
- Nearly-sorted detection (Gen 4)
Champion saved to: .evolve/sorting/rust/src/evolved.rs
State saved to: .evolve/sorting/evolution.json
To continue evolution later: /evolve-perf --resume
You are an algorithm optimizer. Improve this Rust code for SPEED.
TRAIT TO IMPLEMENT:
<trait definition>
CURRENT CODE:
<code>
STRATEGY: <strategy>
Requirements:
- Must implement the trait exactly
- Must pass all correctness tests
- Focus purely on PERFORMANCE
- Use unsafe if it helps (with proper safety invariants)
Return ONLY the complete Rust code for evolved.rs, no explanations.
You are creating a HYBRID algorithm by combining two parent solutions.
TRAIT TO IMPLEMENT:
<trait definition>
PARENT A: <algorithm_family_a> (<ops_per_second_a> ops/sec)
Innovations: <key_innovations_a>
Strengths: <strengths_a>
Weaknesses: <weaknesses_a>
CODE A:
<code_a>
---
PARENT B: <algorithm_family_b> (<ops_per_second_b> ops/sec)
Innovations: <key_innovations_b>
Strengths: <strengths_b>
Weaknesses: <weaknesses_b>
CODE B:
<code_b>
---
Create a HYBRID solution that:
1. COMBINES key innovations from BOTH parents
2. Uses A's approach where A is strong, B's where B is strong
3. May dispatch based on input characteristics (size, pattern detection)
4. Inherits the best constants/thresholds from each
The goal is a solution FASTER than either parent by combining their strengths.
Return ONLY the complete Rust code for evolved.rs, no explanations.
Analyze this algorithm implementation and extract its key innovations.
CODE:
<code>
PERFORMANCE: <ops_per_second> ops/sec
Respond in this exact JSON format:
{
"algorithm_family": "<e.g., radix_sort, quicksort, lookup_table, simd, etc.>",
"key_innovations": ["<technique 1>", "<technique 2>", ...],
"strengths": ["<fast on what>", ...],
"weaknesses": ["<slow on what>", ...],
"complexity": {"time": "<O(?)>", "space": "<O(?)>"}
}
When user chooses "radical mutations only" after plateau:
Spawn 8 agents with high-variance strategies:
This increases variance to escape local optima at the cost of more failed mutations.
For Apple Silicon Macs, leverage Metal GPU for parallel candidate evaluation:
| Scenario | GPU Benefit | Recommendation |
|---|---|---|
| Many candidates (>8) | High | Enable |
| Large input sizes (>100k elements) | High | Enable |
| Compute-heavy fitness function | High | Enable |
| Simple evaluation, few candidates | Low | Skip |
| Memory-bound algorithm | Low | Skip |
Add to Cargo.toml:
[dependencies]
metal = "0.28"
objc = "0.2"
[features]
gpu = ["metal", "objc"]
#[cfg(feature = "gpu")]
mod gpu_eval {
use metal::*;
pub fn evaluate_candidates_parallel(
candidates: &[Box<dyn BinPackingHeuristic>],
test_data: &[Vec<u32>],
) -> Vec<f64> {
let device = Device::system_default().expect("No Metal device");
let command_queue = device.new_command_queue();
// Batch evaluation across GPU cores
candidates.par_iter()
.map(|c| evaluate_single(c, test_data))
.collect()
}
}
| Candidates | CPU Time | GPU Time | Speedup |
|---|---|---|---|
| 8 | 2.4s | 0.8s | 3x |
| 16 | 4.8s | 1.0s | 4.8x |
| 32 | 9.6s | 1.4s | 6.9x |
# Build with GPU support
cargo build --release --features gpu
# Run benchmark with GPU acceleration
cargo run --release --features gpu --bin benchmark
fn evaluate_all(candidates: &[Candidate]) -> Vec<f64> {
#[cfg(feature = "gpu")]
if metal_available() {
return gpu_eval::evaluate_candidates_parallel(candidates);
}
// CPU fallback (always available)
candidates.iter()
.map(|c| evaluate_single(c))
.collect()
}
For problems amenable to GPU computation, evolve Metal compute shaders:
| Problem | GPU Potential | Notes |
|---|---|---|
| Sorting (large arrays) | High | Bitonic sort, radix sort |
| Matrix operations | Very High | Native GPU strength |
| Graph algorithms | Medium | Depends on structure |
| String search | Low | Memory-bound |
| Bin packing | Low | Sequential decisions |
For GPU-amenable problems, generate and evolve Metal compute shaders:
// Example: evolved sorting kernel
kernel void evolved_sort(
device uint* data [[buffer(0)]],
constant uint& n [[buffer(1)]],
uint gid [[thread_position_in_grid]]
) {
// Evolved GPU algorithm here
// ...
}
Some problems benefit from hybrid approaches:
impl BinPackingHeuristic for HybridEvolved {
fn priority(&self, item: u32, bins: &[u32]) -> Vec<f64> {
if bins.len() > 1000 {
// GPU path for large bin counts
gpu_priority(item, bins)
} else {
// CPU path for small counts
cpu_priority(item, bins)
}
}
}
Log GPU-specific metrics:
{
"id": "gen3_gpu_radix",
"platform": "metal",
"shader_path": "shaders/gen3_radix.metal",
"metrics": {
"cpu_time_ms": 45.2,
"gpu_time_ms": 8.7,
"speedup": 5.2,
"gpu_utilization": 0.78
}
}
Budget scales with problem complexity (agents × generations):
| Budget | Simple (10 agents) | Medium (16 agents) | Complex (26 agents) |
|---|---|---|---|
| 10k | 2 gens | 1 gen | 1 gen |
| 50k | 10 gens | 6 gens | 4 gens |
| 100k | 20 gens | 12 gens | 7 gens |
| 200k | 40 gens | 25 gens | 15 gens |
Rule of thumb: More agents = better Gen1 exploration, more generations = better refinement.
The system will recommend based on problem analysis:
Most improvements happen in first 5-10 generations. More agents help for: