Help us improve
Share bugs, ideas, or general feedback.
From olympus
Self-Evolution — improve Olympus itself through real-world testing and behavioral evaluation
npx claudepluginhub devy1540/olympus --plugin olympusHow this skill is triggered — by the user, by Claude, or both
Slash command
/olympus:evolveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<Purpose>
Directs multi-cycle code improvements via causal hypotheses on rubric scores, scout validation, axis-parallel fleet attacks, pattern extraction, and persistent belief models across sessions. Use for sustained autonomous quality gains.
Runs autonomous optimization loops to iteratively improve prompts, templates, configs, or code using four-way separation of main agent, eval agent, test runner, and deterministic eval.py judge. Invoke via /autoresearch or 'optimize this prompt'.
Runs iterative refinement loops to improve artifacts like codebases, documents, prompts, pipelines using evidence-based judges, optional evaluators, and auto-selected single/multi-judge modes based on complexity.
Share bugs, ideas, or general feedback.
<Execution_Policy>
<Team_Structure> team_name: "evolve-${CLAUDE_SESSION_ID}"
Teammates:
| Agent | Role | Comm Targets |
|---|---|---|
| athena | Quality evaluation (5 dimensions) | leader |
| eris | Evaluation challenge + root cause | metis (cross-reference), leader |
| metis | Expected-actual gap analysis | eris (cross-reference), leader |
| prometheus | Prompt improvement implementation | leader |
Direct communication: metis ↔ eris (cross-reference during diagnosis) </Team_Structure>
Call ToolSearch("+olympus pipeline") to load MCP tools.
1. TeamCreate(team_name: "evolve-${CLAUDE_SESSION_ID}")
2. olympus_start_pipeline(skill: "evolve", pipeline_id: ...)
3. Create artifact directory: .olympus/evolve-{YYYYMMDD}-{short-uuid}/
Input classification:
- User-provided: user specifies benchmark directly
- Auto-generated: AskUserQuestion to select target skill
- History: reuse from previous evolve run
Generate benchmark.md:
## Benchmark
### Target Skill: {skill}
### Scenario: {description}
### Expected Quality (5 dimensions: Specificity, Evidence Density, Role Adherence, Efficiency, Actionability)
### Test Input: {data}
Save to ${ARTIFACT_DIR}/benchmark.md
Execute target skill against benchmark:
- Oracle → spec.md
- Pantheon → analysis.md
- Tribunal → verdict.md
- Odyssey → full pipeline
Collect observation data:
- Each agent output
- Round counts, gate history, handoff records
Save to ${ARTIFACT_DIR}/dogfood-result.md
athena_result = Agent(name: "athena", team_name: ${TEAM},
subagent_type: "olympus:athena",
prompt: "You are Athena in team ${TEAM}, quality evaluator. Artifact directory: ${ARTIFACT_DIR}/
LEADER_NAME: team-lead
IMMEDIATE TASK: DO NOT write files — you are read-only.
Read ${ARTIFACT_DIR}/benchmark.md and dogfood-result.md.
Evaluate across 5 dimensions (0.0~1.0):
1. Specificity: concrete claims with file:line?
2. Evidence Density: evidence-backed claims ratio
3. Role Adherence: agents stayed within boundaries?
4. Efficiency: goal reached without unnecessary rounds?
5. Actionability: output immediately actionable?
When done: SendMessage(to: 'team-lead', summary: 'athena 평가 완료', '{eval-matrix content}')")
olympus_register_agent_spawn(pipeline_id, "athena")
→ Write eval-matrix.md from athena SendMessage
olympus_record_execution(pipeline_id, "evolve", "athena", ...)
Spawn metis + eris IN PARALLEL (BACKGROUND, with cross-consultation):
Agent(name: "metis", team_name: ${TEAM},
subagent_type: "olympus:metis",
run_in_background: true,
prompt: "You are Metis in team ${TEAM}, gap analyst. Artifact directory: ${ARTIFACT_DIR}/
LEADER_NAME: team-lead
IMMEDIATE TASK: DO NOT write files — you are read-only.
Read ${ARTIFACT_DIR}/eval-matrix.md, dogfood-result.md, and agents/*.md.
Trace quality issues to specific agent prompts:
- Investigation_Protocol insufficient?
- Output_Format fails to enforce specificity?
- Constraints allow role drift?
Derive improvement proposals.
MANDATORY CONSULTATION: Send your draft to 'eris' via SendMessage before finalizing.
Incorporate eris's valid challenges.
When done: SendMessage(to: 'team-lead', summary: 'metis 진단 완료', '{full diagnosis}')")
olympus_register_agent_spawn(pipeline_id, "metis")
Agent(name: "eris", team_name: ${TEAM},
subagent_type: "olympus:eris",
run_in_background: true,
prompt: "You are Eris in team ${TEAM}, evaluation challenger. Artifact directory: ${ARTIFACT_DIR}/
LEADER_NAME: team-lead
IMMEDIATE TASK: DO NOT write files — you are read-only.
Read ${ARTIFACT_DIR}/eval-matrix.md and dogfood-result.md.
Verify Athena's evaluation accuracy:
- Scoring too generous?
- Missed problems?
- Root causes or just symptoms?
MANDATORY CONSULTATION: When metis sends you a draft, challenge each claim directly
via SendMessage(to: 'metis').
When done: SendMessage(to: 'team-lead', summary: 'eris 검증 완료', '{full evaluation}')")
olympus_register_agent_spawn(pipeline_id, "eris")
olympus_pipeline_status(pipeline_id) # verify metis + eris are registered before waiting
DEADLOCK FALLBACK: metis sends draft to eris; eris challenges back. If 5 minutes elapse without both completing:
→ SendMessage(to: "metis", "Cross-verification timeout. Finalize diagnosis without eris response. Note 'eris consultation pending'.")
→ SendMessage(to: "eris", "Cross-verification timeout. Finalize evaluation without metis draft. Note 'metis draft pending'.")
→ Leader synthesizes from whichever responded; flags incomplete cross-verification in diagnosis.md.
WAIT for both completion notifications → leader synthesizes into diagnosis.md
olympus_record_execution(pipeline_id, "evolve", "metis", ...)
olympus_record_execution(pipeline_id, "evolve", "eris", ...)
olympus_log_collaboration(pipeline_id, "metis", "eris", "진단 크로스 검증: metis↔eris")
Present diagnosis.md to user:
AskUserQuestion: "Apply these improvements?"
["Apply all", "Select", "Modify", "Skip"]
IF user approves:
prometheus_result = Agent(name: "prometheus", team_name: ${TEAM},
subagent_type: "olympus:prometheus",
prompt: "You are Prometheus in team ${TEAM}, prompt improver. Artifact directory: ${ARTIFACT_DIR}/
LEADER_NAME: team-lead
IMMEDIATE TASK: Read ${ARTIFACT_DIR}/diagnosis.md — find all Improvement Proposals.
Apply each proposal to the target files (agents/*.md and/or skills/*/SKILL.md as specified).
Rules: implement ONLY what diagnosis.md specifies, no scope creep, no extra refactoring.
For each change: note the exact section modified and what was changed.
When done: SendMessage(to: 'team-lead', summary: 'prometheus 개선 완료', '{change report with files modified}')")
olympus_register_agent_spawn(pipeline_id, "prometheus")
→ Write refinement-log.md from prometheus SendMessage
olympus_record_execution(pipeline_id, "evolve", "prometheus", ...)
Run /olympus:audit on modified prompts:
CLEAN → Step 8
VIOLATION → return to Step 6 (modification broke structure)
WARNING → notify user, then Step 8
Update evolve-state.json:
{ iteration, overall, scores: { specificity, evidence_density, role_adherence, efficiency, actionability }, changes, audit_result }
# Note: 'overall' is required for gate validation (validate-gate.sh checks .overall)
# 'scores' holds 5 dimension values (validate-gate.sh checks .scores.* >= 0.6)
Convergence:
olympus_gate_check(pipeline_id, "semantic", overall_score)
# Also verify per-dimension minimums (gate-thresholds.json → evolve_dimension_minimum)
IF overall >= 0.8 AND all 5 dimensions >= 0.6: converged → generate final report
ELIF overall >= 0.8 BUT any dimension < 0.6: not converged — address weak dimension explicitly
ELIF iteration >= maxIterations (5): AskUserQuestion [Continue, Accept, Reset]
ELIF score_delta < 0.02 for 2 iterations:
next = olympus_next_action(pipeline_id)
# next.action: retry_phase (stagnation — suggest persona switch or benchmark change)
→ notify user of stagnation with options
ELSE:
next = olympus_next_action(pipeline_id)
# next.action: retry_phase → return to Step 3 with next.hint
→ return to Step 3 (same benchmark)
← Teammates REMEMBER previous iterations — evaluation improves
Note: per-dimension minimum (0.6) prevents a weak dimension being masked by high scores elsewhere.
Generate final report: score progression, key improvements, remaining weaknesses
Shutdown all teammates → TeamDelete
<Tool_Usage> MCP Tools:
Team Tools:
<Artifact_Contracts>
| File | Step | Writer | Readers |
|---|---|---|---|
| benchmark.md | 2 | Leader | All |
| dogfood-result.md | 3 | Leader | athena, metis |
| eval-matrix.md | 4 | Leader (from athena) | eris, metis |
| diagnosis.md | 5 | Leader (from metis+eris) | prometheus |
| refinement-log.md | 6 | Leader | Tracking |
| evolve-state.json | All | Leader | Convergence |
| </Artifact_Contracts> |
<Benchmark_Library> Oracle: "Build a login feature" → spec.md from vague input Pantheon: sample payment code → domain-specific perspectives Tribunal: intentionally flawed code → accurate detection of unmet ACs </Benchmark_Library>