Skill

evolve

Self-Evolution — improve Olympus itself through real-world testing and behavioral evaluation

npx claudepluginhub devy1540/olympus --plugin olympus

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/olympus:evolve

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

SKILL.md

291 lines · ~2.9k tokens

Similar Skills

evolve

566

Directs multi-cycle code improvements via causal hypotheses on rubric scores, scout validation, axis-parallel fleet attacks, pattern extraction, and persistent belief models across sessions. Use for sustained autonomous quality gains.

2 files

citadel

autoresearch

Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.

4 files

autoresearch

simmer

Runs iterative refinement loops to improve artifacts like codebases, documents, prompts, pipelines using evidence-based judges, optional evaluators, and auto-selected single/multi-judge modes based on complexity.

simmer

Stats

LanguageShell

Stars1

MaintenanceExcellent

Last CommitApr 6, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Agent	Role	Comm Targets
athena	Quality evaluation (5 dimensions)	leader
eris	Evaluation challenge + root cause	metis (cross-reference), leader
metis	Expected-actual gap analysis	eris (cross-reference), leader
prometheus	Prompt improvement implementation	leader

Agent

Role

Comm Targets

athena

Quality evaluation (5 dimensions)

leader

eris

Evaluation challenge + root cause

metis (cross-reference), leader

metis

Expected-actual gap analysis

eris (cross-reference), leader

prometheus

Prompt improvement implementation

leader

Input classification: - User-provided: user specifies benchmark directly - Auto-generated: AskUserQuestion to select target skill - History: reuse from previous evolve run Generate benchmark.md: ## Benchmark ### Target Skill: {skill} ### Scenario: {description} ### Expected Quality (5 dimensions: Specificity, Evidence Density, Role Adherence, Efficiency, Actionability) ### Test Input: {data} Save to ${ARTIFACT_DIR}/benchmark.md

Execute target skill against benchmark: - Oracle → spec.md - Pantheon → analysis.md - Tribunal → verdict.md - Odyssey → full pipeline Collect observation data: - Each agent output - Round counts, gate history, handoff records Save to ${ARTIFACT_DIR}/dogfood-result.md

athena_result = Agent(name: "athena", team_name: ${TEAM}, subagent_type: "olympus:athena", prompt: "You are Athena in team ${TEAM}, quality evaluator. Artifact directory: ${ARTIFACT_DIR}/ LEADER_NAME: team-lead IMMEDIATE TASK: DO NOT write files — you are read-only. Read ${ARTIFACT_DIR}/benchmark.md and dogfood-result.md. Evaluate across 5 dimensions (0.0~1.0): 1. Specificity: concrete claims with file:line? 2. Evidence Density: evidence-backed claims ratio 3. Role Adherence: agents stayed within boundaries? 4. Efficiency: goal reached without unnecessary rounds? 5. Actionability: output immediately actionable? When done: SendMessage(to: 'team-lead', summary: 'athena 평가 완료', '{eval-matrix content}')") olympus_register_agent_spawn(pipeline_id, "athena") → Write eval-matrix.md from athena SendMessage olympus_record_execution(pipeline_id, "evolve", "athena", ...)

Spawn metis + eris IN PARALLEL (BACKGROUND, with cross-consultation): Agent(name: "metis", team_name: ${TEAM}, subagent_type: "olympus:metis", run_in_background: true, prompt: "You are Metis in team ${TEAM}, gap analyst. Artifact directory: ${ARTIFACT_DIR}/ LEADER_NAME: team-lead IMMEDIATE TASK: DO NOT write files — you are read-only. Read ${ARTIFACT_DIR}/eval-matrix.md, dogfood-result.md, and agents/*.md. Trace quality issues to specific agent prompts: - Investigation_Protocol insufficient? - Output_Format fails to enforce specificity? - Constraints allow role drift? Derive improvement proposals. MANDATORY CONSULTATION: Send your draft to 'eris' via SendMessage before finalizing. Incorporate eris's valid challenges. When done: SendMessage(to: 'team-lead', summary: 'metis 진단 완료', '{full diagnosis}')") olympus_register_agent_spawn(pipeline_id, "metis") Agent(name: "eris", team_name: ${TEAM}, subagent_type: "olympus:eris", run_in_background: true, prompt: "You are Eris in team ${TEAM}, evaluation challenger. Artifact directory: ${ARTIFACT_DIR}/ LEADER_NAME: team-lead IMMEDIATE TASK: DO NOT write files — you are read-only. Read ${ARTIFACT_DIR}/eval-matrix.md and dogfood-result.md. Verify Athena's evaluation accuracy: - Scoring too generous? - Missed problems? - Root causes or just symptoms? MANDATORY CONSULTATION: When metis sends you a draft, challenge each claim directly via SendMessage(to: 'metis'). When done: SendMessage(to: 'team-lead', summary: 'eris 검증 완료', '{full evaluation}')") olympus_register_agent_spawn(pipeline_id, "eris") olympus_pipeline_status(pipeline_id) # verify metis + eris are registered before waiting DEADLOCK FALLBACK: metis sends draft to eris; eris challenges back. If 5 minutes elapse without both completing: → SendMessage(to: "metis", "Cross-verification timeout. Finalize diagnosis without eris response. Note 'eris consultation pending'.") → SendMessage(to: "eris", "Cross-verification timeout. Finalize evaluation without metis draft. Note 'metis draft pending'.") → Leader synthesizes from whichever responded; flags incomplete cross-verification in diagnosis.md. WAIT for both completion notifications → leader synthesizes into diagnosis.md olympus_record_execution(pipeline_id, "evolve", "metis", ...) olympus_record_execution(pipeline_id, "evolve", "eris", ...) olympus_log_collaboration(pipeline_id, "metis", "eris", "진단 크로스 검증: metis↔eris")

Present diagnosis.md to user: AskUserQuestion: "Apply these improvements?" ["Apply all", "Select", "Modify", "Skip"] IF user approves: prometheus_result = Agent(name: "prometheus", team_name: ${TEAM}, subagent_type: "olympus:prometheus", prompt: "You are Prometheus in team ${TEAM}, prompt improver. Artifact directory: ${ARTIFACT_DIR}/ LEADER_NAME: team-lead IMMEDIATE TASK: Read ${ARTIFACT_DIR}/diagnosis.md — find all Improvement Proposals. Apply each proposal to the target files (agents/*.md and/or skills/*/SKILL.md as specified). Rules: implement ONLY what diagnosis.md specifies, no scope creep, no extra refactoring. For each change: note the exact section modified and what was changed. When done: SendMessage(to: 'team-lead', summary: 'prometheus 개선 완료', '{change report with files modified}')") olympus_register_agent_spawn(pipeline_id, "prometheus") → Write refinement-log.md from prometheus SendMessage olympus_record_execution(pipeline_id, "evolve", "prometheus", ...)

Update evolve-state.json: { iteration, overall, scores: { specificity, evidence_density, role_adherence, efficiency, actionability }, changes, audit_result } # Note: 'overall' is required for gate validation (validate-gate.sh checks .overall) # 'scores' holds 5 dimension values (validate-gate.sh checks .scores.* >= 0.6) Convergence: olympus_gate_check(pipeline_id, "semantic", overall_score) # Also verify per-dimension minimums (gate-thresholds.json → evolve_dimension_minimum) IF overall >= 0.8 AND all 5 dimensions >= 0.6: converged → generate final report ELIF overall >= 0.8 BUT any dimension < 0.6: not converged — address weak dimension explicitly ELIF iteration >= maxIterations (5): AskUserQuestion [Continue, Accept, Reset] ELIF score_delta < 0.02 for 2 iterations: next = olympus_next_action(pipeline_id) # next.action: retry_phase (stagnation — suggest persona switch or benchmark change) → notify user of stagnation with options ELSE: next = olympus_next_action(pipeline_id) # next.action: retry_phase → return to Step 3 with next.hint → return to Step 3 (same benchmark) ← Teammates REMEMBER previous iterations — evaluation improves Note: per-dimension minimum (0.6) prevents a weak dimension being masked by high scores elsewhere. Generate final report: score progression, key improvements, remaining weaknesses

File	Step	Writer	Readers
benchmark.md	2	Leader	All
dogfood-result.md	3	Leader	athena, metis
eval-matrix.md	4	Leader (from athena)	eris, metis
diagnosis.md	5	Leader (from metis+eris)	prometheus
refinement-log.md	6	Leader	Tracking
evolve-state.json	All	Leader	Convergence
</Artifact_Contracts>

File

Step

Writer

Readers

benchmark.md

Leader

All

dogfood-result.md

Leader

athena, metis

eval-matrix.md

Leader (from athena)

eris, metis

diagnosis.md

Leader (from metis+eris)

prometheus

refinement-log.md

Leader

Tracking

evolve-state.json

All

Leader

Convergence

</Artifact_Contracts>

Agent	Role	Comm Targets
athena	Quality evaluation (5 dimensions)	leader
eris	Evaluation challenge + root cause	metis (cross-reference), leader
metis	Expected-actual gap analysis	eris (cross-reference), leader
prometheus	Prompt improvement implementation	leader

Agent

Role

Comm Targets

athena

Quality evaluation (5 dimensions)

leader

eris

Evaluation challenge + root cause

metis (cross-reference), leader

metis

Expected-actual gap analysis

eris (cross-reference), leader

prometheus

Prompt improvement implementation

leader

File	Step	Writer	Readers
benchmark.md	2	Leader	All
dogfood-result.md	3	Leader	athena, metis
eval-matrix.md	4	Leader (from athena)	eris, metis
diagnosis.md	5	Leader (from metis+eris)	prometheus
refinement-log.md	6	Leader	Tracking
evolve-state.json	All	Leader	Convergence
</Artifact_Contracts>

File

Step

Writer

Readers

benchmark.md

Leader

All

dogfood-result.md

Leader

athena, metis

eval-matrix.md

Leader (from athena)

eris, metis

diagnosis.md

Leader (from metis+eris)

prometheus

refinement-log.md

Leader

Tracking

evolve-state.json

All

Leader

Convergence

</Artifact_Contracts>

evolve

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

evolve

Popularity

Invocation

Context Preview

SKILL.md

Step 0: Load MCP Tools (REQUIRED FIRST)

Step 1: Initialize

Step 2: Benchmark Selection

Step 3: Dogfood (Real Execution)

Step 4: Evaluate (Athena)

Step 5: Diagnose (Metis + Eris in PARALLEL)

Step 6: Refine (Prometheus)

Step 7: Audit (Consistency Check)

Step 8: Convergence Check

Step 9: Teardown

Similar Skills

Help us improve

Step 0: Load MCP Tools (REQUIRED FIRST)

Step 1: Initialize

Step 2: Benchmark Selection

Step 3: Dogfood (Real Execution)

Step 4: Evaluate (Athena)

Step 5: Diagnose (Metis + Eris in PARALLEL)

Step 6: Refine (Prometheus)

Step 7: Audit (Consistency Check)

Step 8: Convergence Check

Step 9: Teardown