From super-ralph
Autonomous agentic loop that decomposes any user query into tasks, writes tests first, implements with fresh sub-agents, self-debugs on failure, and learns over time. Use when the user says "super ralph", "ralph this", "break this down and build it", or wants autonomous multi-task execution with quality enforcement.
npx claudepluginhub ashcastelinocs124/super-ralphThis skill uses the workspace's default tool permissions.
Autonomous agentic loop: decompose → test → build → debug → learn → merge.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Designs, implements, and audits WCAG 2.2 AA accessible UIs for Web (ARIA/HTML5), iOS (SwiftUI traits), and Android (Compose semantics). Audits code for compliance gaps.
Autonomous agentic loop: decompose → test → build → debug → learn → merge.
This skill is authored in Claude-native terms and is also intended to run in Codex environments without removing any Claude behavior.
Use this mapping when running outside Claude:
| Claude primitive | Codex-compatible equivalent |
|---|---|
AskUserQuestion | Ask a single plain-text question in chat, include the same options, and wait for a reply before proceeding |
Agent tool dispatch | Run the same role prompt in a fresh Codex session/agent run; parallelize independent work with foreground parallel runs |
run_in_background: true | Do not detach jobs. Keep work in foreground sessions so prompts/approvals and progress stay visible |
If this file says "use AskUserQuestion," treat that as "single interactive question gate" in Codex.
User Query
→ Mode Selection: oneshot (fully autonomous) or brainstorm (interactive) — single AskUserQuestion
→ IF oneshot: Permissions Bootstrap — auto-configure .claude/settings.json so sub-agents never prompt
→ IF brainstorm: interactive Q&A to explore intent, scope, edge cases (AskUserQuestion loop)
→ IF oneshot: auto-analyze query, infer intent/scope/constraints, write BRAINSTORM_SUMMARY silently
→ Intent Profile: 3 questions or auto-infer (based on MODE) → JUDGE_RUBRIC
→ Tooling: scan skills/agents, recommend or auto-select (based on MODE) → TOOLING_CONFIG
→ Pre-Flight: scope workspace + set MAX_RETRIES (interactive or defaults based on MODE)
→ Decompose: orchestrator breaks query into tasks directly (no separate manager/planner agent)
→ Per Task (parallel if independent — never use run_in_background):
→ ralph-tester: write tests → JUDGE: pass? → retry tester if fail
→ ralph-worker: implement → JUDGE: pass? → retry worker if fail → run tests
→ Fail MAX_RETRIES/2? → debug.md → ralph-debugger → JUDGE: pass? → fresh ralph-worker
→ Fail MAX_RETRIES? → auto-skip + log to learnings
→ Pass → clear debug.md
→ ralph-merger: combine outputs → JUDGE: pass? → retry merger if fail → deliver
| Agent | File | Role |
|---|---|---|
| ralph-tester | agents/ralph-tester.md | Writes strict tests before implementation |
| ralph-worker | agents/ralph-worker.md | Implements until tests pass, writes debug.md on attempt 3 |
| ralph-debugger | agents/ralph-debugger.md | Cold analysis of failures, writes fix plan |
| ralph-judge | agents/ralph-judge.md | Universal quality gate — evaluates every sub-agent's output against task criteria |
| ralph-merger | agents/ralph-merger.md | Combines outputs into cohesive deliverable |
Note: Planning/decomposition is handled directly by the orchestrator (this skill), not a separate manager or planner agent. The orchestrator already has the brainstorm summary, tooling config, learnings, and codebase context, so an extra control layer would just duplicate work.
Note: The ralph-judge agent evaluates every sub-agent's output before the loop continues. If the judge rejects, the same agent is retried with the judge's specific feedback. There is no retry limit on judge rejections — the agent keeps retrying until the judge passes. Each retry is a fresh agent with zero prior context.
After Phase 0, the entire loop runs without any user interaction. No AskUserQuestion calls, no confirmations, no escalations. If something fails after max retries, auto-skip it and log to learnings. The user said "go ahead" — respect that.
The first and possibly only question Super Ralph asks. Determines whether the rest of the setup is interactive or autonomous.
/ralph command)If MODE was already set before this phase (e.g., the user invoked /ralph which pre-sets MODE = oneshot), skip this question entirely and proceed with the pre-set mode. Do NOT ask the mode selection question — the user already chose by using the oneshot command.
If MODE is not pre-set, ask one AskUserQuestion:
question: "How should I approach this?"
header: "Mode"
options:
- label: "Oneshot"
description: "I'll handle everything autonomously — no questions, just deliver"
- label: "Brainstorm"
description: "Let's explore the idea together step by step"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Store the answer as MODE:
MODE = oneshotMODE = brainstorm| MODE | Behavior |
|---|---|
brainstorm | All phases run interactively as documented below (existing behavior, unchanged) |
oneshot | All phases still run, but every AskUserQuestion gate is replaced with autonomous self-decision. No further user interaction until final delivery. |
Oneshot does NOT skip phases. Every phase (Brainstorm, Intent Profile, Tooling, Pre-Flight, Decompose, Execute, Merge) still runs in full. The only difference is that Ralph makes the decisions at each gate instead of asking the user. The pipeline is identical — the decision-maker changes.
When MODE = oneshot, the orchestrator makes these decisions autonomously instead of asking:
| Phase | Default |
|---|---|
| Brainstorm | Analyze query, infer intent/scope/constraints, write BRAINSTORM_SUMMARY |
| Intent Profile | Default to middle tier: solid and correct + my team + weeks to months |
| Tooling | Auto-select recommended toolset based on BRAINSTORM_SUMMARY |
| Pre-Flight | Current directory writable, no read-only, nothing off-limits, MAX_RETRIES=6 |
This phase only runs when MODE = oneshot. If MODE = brainstorm, skip this entirely — the user keeps normal Claude Code permission prompts.
When running in oneshot mode, sub-agents need to execute without blocking on tool approval prompts. This phase auto-configures the project's .claude/settings.json so every tool Ralph's agents use is pre-approved.
.claude/settings.json exists in the current working directory.permissions.allow already contains the Ralph entries. If any are missing, merge them in (preserve existing permissions, deduplicate).{
"permissions": {
"allow": [
"Bash(mkdir:*)",
"Bash(find:*)",
"Bash(ls:*)",
"Bash(cat:*)",
"Bash(rm:*)",
"Bash(cp:*)",
"Bash(mv:*)",
"Bash(touch:*)",
"Bash(chmod:*)",
"Bash(cd:*)",
"Bash(pwd:*)",
"Bash(echo:*)",
"Bash(head:*)",
"Bash(tail:*)",
"Bash(wc:*)",
"Bash(diff:*)",
"Bash(sort:*)",
"Bash(uniq:*)",
"Bash(grep:*)",
"Bash(sed:*)",
"Bash(awk:*)",
"Bash(git:*)",
"Bash(python*)",
"Bash(pip*)",
"Bash(node*)",
"Bash(npm*)",
"Bash(npx*)",
"Bash(yarn*)",
"Bash(pnpm*)",
"Bash(bun*)",
"Bash(cargo*)",
"Bash(go *)",
"Bash(make*)",
"Bash(pytest*)",
"Bash(jest*)",
"Bash(vitest*)",
"Bash(mocha*)",
"Bash(ruby*)",
"Bash(bundle*)",
"Bash(rspec*)",
"Bash(swift*)",
"Bash(rustc*)",
"Bash(gcc*)",
"Bash(g++*)",
"Bash(clang*)",
"Bash(java*)",
"Bash(mvn*)",
"Bash(gradle*)",
"Bash(dotnet*)",
"Bash(docker*)",
"Read",
"Edit",
"Write",
"Glob",
"Grep",
"Agent"
]
}
}
Run the setup script if available, otherwise do it inline:
# Option A: script exists in Ralph's install directory
bash "$HOME/super-ralph/scripts/setup-permissions.sh" .
# Option B: inline (if script not found)
# Read existing .claude/settings.json, merge permissions, write back
If doing it inline (no script available), use this approach:
mkdir -p .claude.claude/settings.json if it exists (or start with {"permissions":{"allow":[]}})allow arrayBefore scoping the workspace or planning tasks, explore the user's idea through conversation. The goal is to deeply understand what the user actually wants — not just what they typed.
MODE = brainstorm: run the interactive flow below as documented.MODE = oneshot: this phase still runs in full — Ralph makes the decisions instead of asking the user. Replace every AskUserQuestion with Ralph's own judgment:
BRAINSTORM_SUMMARY autonomously based on your analysis.This phase is interactive. Use AskUserQuestion as prehook gates — one question at a time, each with a "Chat about this" escape hatch that fully stops the workflow. In Codex, ask the same question directly in chat and wait for the response before continuing.
Restate the query — show the user your understanding of what they're asking for in 2-3 sentences. This surfaces misunderstandings early.
Ask clarifying questions — use AskUserQuestion as prehook gates:
question: "[Specific question about the user's intent, scope, or approach]"
header: "Clarify"
options:
- label: "[Most likely answer]"
description: "[What this means for the build]"
- label: "[Alternative interpretation]"
description: "[What this means for the build]"
- label: "[Simpler/narrower version]"
description: "[What this means for the build]"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
| Area | Example Questions |
|---|---|
| Intent | "Is this a prototype or production-grade?" |
| Scope | "Should this include X, or is that out of scope?" |
| Edge cases | "What should happen when Y occurs?" |
| Users | "Who will use this — just you, your team, or end users?" |
| Constraints | "Any specific tech stack, libraries, or patterns to use/avoid?" |
| Existing work | "Is this building on something that already exists, or greenfield?" |
## Brainstorm Summary
**Query:** {original user query}
**Intent:** {what the user actually wants, in your words}
### Scope
- {what's in scope}
- {what's explicitly out of scope}
### Key Decisions
- {decision 1 from the Q&A}
- {decision 2 from the Q&A}
### Edge Cases Discussed
- {edge case and agreed handling}
### Constraints
- {any tech/approach constraints from the user}
AskUserQuestion:question: "Here's what I'll build. Does this capture it?"
header: "Confirm"
options:
- label: "Yes, go ahead"
description: "This is right — proceed to workspace setup and autonomous execution"
- label: "Almost — let me adjust"
description: "I'll clarify what needs changing"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
If "Almost" → incorporate feedback, update summary, re-confirm. If "Yes" → store the summary as BRAINSTORM_SUMMARY and proceed to Phase -0.75 (Intent Profile).
AskUserQuestion when possible (up to 4 per call)The BRAINSTORM_SUMMARY is used by the orchestrator during task decomposition (Phase 1) alongside learnings, tooling config, and workspace rules. This ensures tasks are decomposed based on the explored, confirmed intent — not just the raw query.
After confirming the brainstorm summary, capture the user's intent profile through 3 direct questions. The answers determine how strictly the judge grades every agent's output — a prototype gets lenient judging on polish, a production system gets strict on everything.
MODE = brainstorm: ask the 3 questions below interactively.MODE = oneshot: this phase still runs in full — Ralph decides the intent profile instead of asking the user. Replace every AskUserQuestion with Ralph's own judgment:
INTENT_PROFILE and JUDGE_RUBRIC from the inferred values.Ask these 3 questions using AskUserQuestion prehook gates. Each includes a "Chat about this" escape hatch.
Question 1 — Priority:
question: "What matters most for this build?"
header: "Priority"
options:
- label: "Just get it working"
description: "Speed over polish — I need something functional fast"
- label: "Solid and correct"
description: "Take the time to handle errors and edge cases properly"
- label: "Ship-ready quality"
description: "Production-grade — clean code, full error handling, security"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Question 2 — Audience:
question: "Who will use what gets built?"
header: "Audience"
options:
- label: "Just me"
description: "Personal tool — I know the quirks, no need for polish"
- label: "My team"
description: "Others will read and maintain this code"
- label: "End users"
description: "Real people will interact with this — UX and reliability matter"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Question 3 — Lifespan:
question: "How long does this need to last?"
header: "Lifespan"
options:
- label: "Throwaway / experiment"
description: "Use it once or twice, then toss it"
- label: "Weeks to months"
description: "Needs to work reliably for a while"
- label: "Long-lived"
description: "This will be maintained and extended over time"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Store the answers as INTENT_PROFILE:
## Intent Profile
**Priority:** [just working | solid and correct | ship-ready]
**Audience:** [just me | my team | end users]
**Lifespan:** [throwaway | weeks to months | long-lived]
Map the intent profile to a JUDGE_RUBRIC — a per-dimension strictness matrix that tells the judge how hard to grade each quality dimension. Use this mapping:
| Dimension | Just working + Just me + Throwaway | Solid + Team + Weeks | Ship-ready + End users + Long-lived |
|---|---|---|---|
| Core functionality | strict | strict | strict |
| Error handling | skip | moderate | strict |
| Edge cases | skip | moderate | strict |
| Code readability | lenient | moderate | strict |
| Security | lenient | moderate | strict |
| Test coverage | happy path only | happy + edges | comprehensive |
| Documentation | skip | inline comments | full docs |
Strictness levels:
Blended profiles: When the 3 answers don't all point to the same tier (e.g., "just get it working" + "end users" + "long-lived"), use the highest tier that any answer maps to for each dimension. User-facing and long-lived code gets strict security even if the user wants speed — that's a safety floor.
Store the result as JUDGE_RUBRIC:
## Judge Rubric
| Dimension | Strictness |
|-----------|------------|
| Core functionality | strict |
| Error handling | [strict/moderate/lenient/skip] |
| Edge cases | [strict/moderate/lenient/skip] |
| Code readability | [strict/moderate/lenient/skip] |
| Security | [strict/moderate/lenient/skip] |
| Test coverage | [comprehensive/happy + edges/happy path only] |
| Documentation | [full docs/inline comments/skip] |
The JUDGE_RUBRIC is injected into every ralph-judge dispatch alongside the task definition and WORKSPACE_RULES. It tells the judge what to care about and how much — adapted to this specific user's intent.
If brainstorming was skipped (dead simple query), default to the middle tier: solid and correct + my team + weeks to months.
After understanding what the user wants to build, figure out what tools will help build it. Scan available skills and agents, match them to the user's goals, and let the user confirm or adjust the toolset.
MODE = brainstorm: scan and present recommendations interactively as documented below.MODE = oneshot: this phase still runs in full — Ralph selects the toolset instead of asking the user. Replace every AskUserQuestion with Ralph's own judgment:
TOOLING_CONFIG and proceed directly to Phase 0.Search for all available skills and agents in the environment:
# Scan for skills
find ~/.claude/skills/ -name "SKILL.md" 2>/dev/null
find .claude/skills/ -name "SKILL.md" 2>/dev/null
find ~/.codex/skills/ -name "SKILL.md" 2>/dev/null
find .codex/skills/ -name "SKILL.md" 2>/dev/null
# Scan for agents
find ~/.claude/agents/ -name "*.md" 2>/dev/null
find .claude/agents/ -name "*.md" 2>/dev/null
find ~/.codex/agents/ -name "*.md" 2>/dev/null
find .codex/agents/ -name "*.md" 2>/dev/null
# Also check for project-local skills
find . -path "*/.claude/skills/*/SKILL.md" 2>/dev/null
find . -path "*/.codex/skills/*/SKILL.md" 2>/dev/null
Read the name and description fields from each discovered skill/agent file. Build an inventory:
AVAILABLE_SKILLS:
- frontend-design: Create distinctive, production-grade frontend interfaces
- system-arch: Plan major architecture changes and evaluate patterns
- claude-api: Build apps with the Claude API or Anthropic SDK
- doc-search: Search third-party API documentation before writing code
- ...
AVAILABLE_AGENTS:
- code-reviewer: Validate work against plan and coding standards
- root-cause-hunter: Drive root-cause analysis for failures
- integration-test-validator: Comprehensive testing validation
- ...
Based on the BRAINSTORM_SUMMARY, identify which skills and agents would be useful. Consider:
| User's Goal | Relevant Tools |
|---|---|
| Building a frontend/UI | frontend-design, landing-page |
| API development | doc-search, claude-api, system-arch |
| Complex architecture | system-arch, validation, debate |
| Refactoring | code-reviewer, integration-test-validator |
| Using third-party APIs | doc-search (auto-fetch docs before coding) |
| Full-stack app | frontend-design, system-arch, doc-search |
Show the matched tools and let the user select which to activate for this run:
question: "Based on what you're building, these skills/agents could help. Which should I use during this run?"
header: "Tooling"
options:
- label: "[Recommended set]"
description: "Use {skill-1}, {skill-2}, {agent-1} — covers {reason}"
- label: "All available"
description: "Activate everything — I'll use whatever helps"
- label: "Just the defaults"
description: "Only use Ralph's 4 built-in agents, no extra skills"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
If the user picks "Recommended set" or "All available", also ask if there are specific tools they want to add or exclude:
question: "Anything to add or exclude from the toolset?"
header: "Adjust"
options:
- label: "Looks good — proceed"
description: "Use the selected toolset as-is"
- label: "Add a skill"
description: "I'll name a specific skill or capability to include"
- label: "Remove something"
description: "I'll say what to exclude"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Store the selected toolset as TOOLING_CONFIG:
## Tooling Config
### Active Skills
- {skill-name}: {how it will be used in this run}
- {skill-name}: {how it will be used in this run}
### Active Agents (beyond Ralph defaults)
- {agent-name}: {when to invoke during the run}
### Skill Integration Rules
- {skill-name} → invoke before {phase/step} (e.g., "doc-search → invoke before ralph-worker writes API calls")
- {skill-name} → invoke during {phase/step} (e.g., "frontend-design → invoke when ralph-worker builds UI components")
The config is used by the orchestrator and injected into sub-agent prompts:
TOOLING_CONFIG during task decomposition to tag tasks with skills_to_use (e.g., "Use doc-search to check the Stripe API before implementing payment logic")frontend-design before building a component, invoke doc-search before calling a third-party API).claude/skills/ take priority over global ones.codex/skills/ take priority over global oneslanding-page skill), pre-select it as the recommended optionBefore any agent runs, scope the workspace using AskUserQuestion prehook gates — one question at a time, each with "Chat about this." This is the last interactive phase before full autonomy.
MODE = brainstorm: ask the 4 questions below interactively.MODE = oneshot: this phase still runs in full — Ralph scopes the workspace itself instead of asking the user. Replace every AskUserQuestion with Ralph's own judgment, using these defaults:
WORKSPACE_RULES from these defaults and proceed directly to Phase 1.Question 1 — Writable directories:
question: "Which files/folders should I work in? (I'll create and modify files here)"
header: "Work in"
options:
- label: "Current directory"
description: "Work in the current project root and subdirectories"
- label: "Specific paths"
description: "I'll list the exact directories/files"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Question 2 — Read-only context:
question: "Any files I should read for context but NOT modify?"
header: "Read-only"
options:
- label: "None — figure it out"
description: "Explore the codebase yourself"
- label: "I'll list them"
description: "Specific files/dirs to read but not touch"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Question 3 — Off-limits:
question: "Anything off-limits? (files/folders I must never touch)"
header: "Off-limits"
options:
- label: "Nothing off-limits"
description: "You can work anywhere within the writable scope"
- label: "I'll list exclusions"
description: "Specific paths to avoid"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Question 4 — Retry limit per task:
question: "How many retries per task before giving up?"
header: "Max retries"
options:
- label: "6 (default)"
description: "3 normal attempts + debug analysis + 3 more attempts"
- label: "4"
description: "2 normal attempts + debug analysis + 2 more attempts"
- label: "10"
description: "5 normal attempts + debug analysis + 5 more attempts"
- label: "Chat about this"
description: "Stop — I want to discuss this before deciding"
multiSelect: false
Store the retry limit as MAX_RETRIES. The debug trigger fires at MAX_RETRIES / 2 (halfway). After MAX_RETRIES total attempts, auto-skip.
Store the answers as WORKSPACE_RULES — inject into every sub-agent prompt:
WORKSPACE RULES:
- You may READ and WRITE files in: [writable paths]
- You may READ (not modify): [read-only paths]
- Do NOT touch: [off-limits paths]
- HARD BOUNDARY: You must NEVER access files outside the paths listed above. No reading, writing, or executing commands that touch anything outside the project directory. This is non-negotiable.
- All permissions granted within these boundaries — do not ask for confirmation on any action.
The orchestrator decomposes the query itself — no separate manager/planner agent needed. It already has everything: BRAINSTORM_SUMMARY, INTENT_PROFILE, TOOLING_CONFIG, learnings, codebase context, and WORKSPACE_RULES.
learnings.md — extract relevant past insights (per-task entries + run summaries). If a pattern failed before, don't repeat it.Break the query into the smallest independent tasks possible. Apply these principles:
skills_to_use where relevantOutput a JSON task array:
[
{
"task_id": 1,
"title": "Short descriptive title",
"description": "Detailed description of what to build. Be explicit about behavior, inputs, outputs, and constraints.",
"quality_standard": "What 'excellent' looks like. Be specific. No shortcuts, no TODOs, no stubs. Production-grade or it doesn't count.",
"success_criteria": [
"Specific testable outcome — an assertion, not a wish",
"Another specific testable outcome"
],
"anti_patterns": [
"Don't stub or mock the hard parts",
"Don't skip error handling",
"Don't leave TODOs or placeholder logic"
],
"dependencies": [],
"test_strategy": "What tests to write, what to assert, what framework to use",
"skills_to_use": ["skill-name — when and why to invoke it during this task"]
}
]
"dependencies": [] — don't invent false dependenciesmkdir -p workspace/task-{id}/tests workspace/task-{id}/output
Dispatch independent tasks in parallel by making multiple Agent tool calls in a single message. Tasks with dependencies wait for their dependencies to complete first. Never use run_in_background: true — instead, dispatch multiple foreground agents concurrently. In Codex, use multiple foreground sessions/agents in parallel for independent tasks.
Never set run_in_background: true when dispatching agents via the Agent tool. Background agents cannot prompt the user for tool permission approvals (WebSearch, WebFetch, Bash, etc.), causing tools to be auto-denied and agents to fail silently.
To parallelize: dispatch multiple foreground agents in a single message (multiple Agent tool calls). They run concurrently and can each prompt for tool permissions. Use this for independent tasks with no shared dependencies.
Codex equivalent: run multiple foreground sessions/agents concurrently and avoid detached/background execution.
For each task from the orchestrator. Parallelize independent tasks by dispatching multiple foreground agents in a single message (no shared dependencies). Never use run_in_background.
while true:
Dispatch ralph-tester with: task definition + WORKSPACE_RULES
+ ralph-tester-learnings.md (agent-specific learnings from past runs)
Tester writes tests to workspace/task-{id}/tests/ and reports the test command.
Dispatch ralph-judge with:
agent_type: "tester"
task definition + output location (workspace/task-{id}/tests/) + JUDGE_RUBRIC + WORKSPACE_RULES
if judge passes:
break → proceed to Step 2b
else:
Dispatch fresh ralph-tester with:
original prompt + "JUDGE REJECTED YOUR PREVIOUS OUTPUT:\n{judge_verdict}\nFix these issues."
(loop until judge passes — no retry limit)
attempt = 0
debug_trigger = MAX_RETRIES / 2 # e.g. 3 if MAX_RETRIES=6
while true:
attempt += 1
if attempt == 1:
failure_context = ""
else:
failure_context = "PREVIOUS ATTEMPT FAILED:\n{last_test_output}\n\nFix the root cause."
if attempt == debug_trigger:
Add to prompt: "This is attempt {debug_trigger}. You MUST write debug.md before exiting."
Dispatch fresh ralph-worker with:
task definition + test locations + failure_context + TOOLING_CONFIG + WORKSPACE_RULES
+ ralph-worker-learnings.md (agent-specific learnings from past runs)
+ PREREQUISITE_LEARNINGS (if task has dependencies — from completed prerequisite tasks)
# ── Judge gate (runs BEFORE tests) ──────────────────────────
while true:
Dispatch ralph-judge with:
agent_type: "worker"
task definition + output location (workspace/task-{id}/output/) + JUDGE_RUBRIC + WORKSPACE_RULES
if judge passes:
break → proceed to test validation
else:
Dispatch fresh ralph-worker with:
original prompt + "JUDGE REJECTED YOUR PREVIOUS OUTPUT:\n{judge_verdict}\nFix these issues."
(loop until judge passes — no retry limit)
# ── Test validation (runs AFTER judge passes) ───────────────
Run tests via Bash: {test_command}
if tests pass:
break → clear debug.md if it exists
if attempt == debug_trigger and tests still fail:
enter Phase 2c (self-debugging)
if attempt >= MAX_RETRIES:
enter Phase 2c Step 4 (auto-skip)
Run tests via Bash after each worker attempt (only reached if judge already passed):
Immediately after a task passes all tests, the orchestrator writes a learnings entry to learnings.md:
### {date} — Task {id}: {title}
- **Attempts:** {attempt_count}
- **Learnings:**
- {generalizable insight from this task — NOT task-specific details}
- {library gotcha, pattern that worked, or assumption that was wrong}
- **Debug insights:** {root cause if debug mode was used, otherwise "N/A"}
Rules for per-task learnings:
Store the per-task learnings in memory for passing to dependent tasks.
When dispatching a task that has dependencies, include the learnings from completed prerequisite tasks in the prompt:
Dispatch ralph-tester/worker with:
task definition + WORKSPACE_RULES + ...
+ PREREQUISITE_LEARNINGS:
"Task 1 (Auth endpoint) learned:
- bcrypt.hashpw() returns bytes, must decode to UTF-8 before storing
- Always use constant-time comparison for password verification"
This gives dependent tasks context from the work that came before them — without polluting independent parallel tasks with irrelevant information.
The worker at attempt MAX_RETRIES/2 writes debug.md with all attempts so far, reasoning, and pattern analysis.
while true:
Dispatch ralph-debugger with: debug.md + task definition + WORKSPACE_RULES
+ ralph-debugger-learnings.md (agent-specific learnings from past runs)
Debugger reads debug.md cold, identifies root cause, appends fix plan.
Dispatch ralph-judge with:
agent_type: "debugger"
task definition + debug.md + JUDGE_RUBRIC + WORKSPACE_RULES
if judge passes:
break → proceed to Step 3
else:
Dispatch fresh ralph-debugger with:
original prompt + "JUDGE REJECTED YOUR FIX PLAN:\n{judge_verdict}\nRevise it."
(loop until judge passes — no retry limit)
Dispatch fresh ralph-worker (attempts MAX_RETRIES/2 + 1 through MAX_RETRIES) with debug.md. Worker follows the fix plan exactly.
The worker's output goes through the same judge gate as Step 2b (judge must pass before tests run).
Run tests again:
Do NOT ask the user. The loop is fully autonomous after pre-flight.
learnings.md (include the root cause from debug.md if available)After ALL tasks complete, dispatch ralph-merger in the foreground (never background) with:
while true:
Dispatch ralph-merger with: task outputs + notes + WORKSPACE_RULES
Merger combines outputs into workspace/final/, resolves integration issues.
Dispatch ralph-judge with:
agent_type: "merger"
task definitions + output location (workspace/final/) + JUDGE_RUBRIC + WORKSPACE_RULES
if judge passes:
break → proceed to Step 3b
else:
Dispatch fresh ralph-merger with:
original prompt + "JUDGE REJECTED YOUR DELIVERABLE:\n{judge_verdict}\nFix these issues."
(loop until judge passes — no retry limit)
Per-task learnings were already written during Phase 2 (Step 2d). The merger now appends a run summary that ties them together:
## {date} — {original user query (shortened)}
**Result:** {passed}/{total} tasks passed | **Attempts:** {total_attempts} | **Time:** {elapsed}
### Run Summary
- {1-2 sentence overview of what was built and how it went}
### Cross-Task Patterns
- {pattern that emerged across multiple tasks — e.g., "all 3 tasks hit the same bcrypt gotcha"}
- {architectural insight from how the pieces fit together}
### Anti-Patterns to Avoid
- {approach that failed across tasks — only if it's a trap others would fall into}
Rules for the run summary:
## {date} — {query}
**Result:** {N}/{N} passed | **Attempts:** {N} | **Time:** {elapsed}
Clean run — no cross-task patterns to note.
If debug.md was used during the run, clear it:
_Empty — ready for next debug session._
debug.md is a scratch pad. learnings.md is the permanent record.
Produce the summary report and present it to the user. The merged output in workspace/final/ is the deliverable.