From evanflow
Orchestrates parallel implementation of 3+ independent tasks with coder-overseer pairs, reviewing bugs, gaps, errors, and cohesion against shared contracts.
npx claudepluginhub evanklem/evanflow --plugin evanflowThis skill uses the workspace's default tool permissions.
See `evanflow` meta-skill for shared terms. New roles introduced here:
Orchestrates 4-phase execution loop (IMPLEMENT, VALIDATE, ADVERSARIAL REVIEW, COMMIT) for complex work units with specs. Verifies outputs adversarially in multi-agent setups.
Decomposes specs/PRDs/plans into independent tasks, assigns to builder agents for parallel execution in waves respecting dependencies, then integrates results. For fast multi-file feature implementation.
Share bugs, ideas, or general feedback.
See evanflow meta-skill for shared terms. New roles introduced here:
evanflow-tdd. Writes tests first. Outputs code + tests + brief summary.SKIP when:
evanflow-executing-plans insteadBefore spawning anyone, the orchestrator writes a contract at .claude/orchestration/<topic>-contract.md (or any path the user prefers). Contents:
<resource>.ts, services are <resource>-service.ts")Result<T, Error>", "all DB writes go through the canonical write helper documented in CLAUDE.md")CONTEXT.md and relevant ADRs### Coder 2: rate-limit service (example)
- test: returns full cap when no usage recorded
assert: `getRemainingThisWeek(userId)` returns `{ remaining: 25, resetsAt: null }` for a fresh user
surface: services/rate-limit.ts (public)
- test: counts ACTIVE and PENDING rows but excludes CANCELLED/FAILED
assert: after seeding rows of varied statuses, `getRemainingThisWeek` returns the correct count
surface: services/rate-limit.ts (public)
The contract is the single source of truth for everyone downstream. If it's wrong or ambiguous, fix it BEFORE spawning agents — patching the contract mid-orchestration causes drift.
Each coder task is a self-contained brief. Includes:
evanflow-tdd)Max 5 coders in parallel. More than that is unmanageable to review.
Coder dispatch happens in two phases to enforce TDD at the orchestration level.
Phase A — RED checkpoint. Single message, multiple Agent calls. Prefer subagent_type: evanflow-coder if available (tool-restricted to prevent git ops and other dangerous actions); else general-purpose. Each coder gets:
"Phase A: write ONLY the first failing test for your first behavior (per the contract's test list for your section). Run it. Confirm it fails for the right reason — not a setup error, not an import error, not a missing dependency. Report back: test file path, test name, the exact failure message, and confirmation that the failure matches expected behavior. Do NOT write any implementation yet. Do NOT touch any production source file other than minimal scaffolding (e.g., empty function stubs that exist only so the import resolves)."
After all coders return Phase A reports, the orchestrator verifies every test is RED:
If any test isn't cleanly RED, send that coder back with the specific issue. Do NOT proceed to Phase B until all RED reports check out.
Phase B — vertical-slice GREEN. Re-message each coder:
"RED checkpoint confirmed. Phase B: vertical-slice TDD per
.claude/skills/evanflow-tdd/SKILL.md. One test → minimal impl → confirm GREEN → next test → repeat. Cover ALL behaviors named in the contract for your section. Watch each test fail before writing the implementation that makes it pass. Conform to the contract exactly — if a test name in the contract is unclear or wrong, stop and report back instead of guessing. Do NOT modify files outside your task scope. Do NOT commit, stage, or run any git op. When done, report: files changed, every test name + status, integration tests touched (if any), anything deferred."
Coders run Phase B in parallel.
After each coder reports done, spawn its overseer. Prefer subagent_type: evanflow-overseer (the bundled read-only subagent — its tool restrictions enforce "report findings, never fix"). If a specialized code-reviewer subagent is available in your environment, that also works. Else general-purpose.
Each overseer gets:
git diff <files> and passes it inline)"Review the diff for: (a) bugs — wrong logic, off-by-ones, race conditions, missing error handling (b) gaps — behaviors in the contract that aren't tested or aren't implemented (c) errors — type unsafety, missing validation at boundaries, wrong domain language (d) cohesion violations — anything that diverges from the contract (e) TDD compliance — was each test written before the code that makes it pass? (Check Phase A report for RED, then Phase B order.) Are tests behavior-through-public-interface, or do they reach into internals? Would the tests survive a refactor that doesn't change behavior? (f) ASSERTION CORRECTNESS — research shows 62% of LLM-generated test assertions are wrong. For each assertion: would a one-character bug in the implementation still let it pass? If yes, the assertion is too weak. Is the assertion on the right field? Is the expected value computed correctly? (g) Five Failure Modes — explicit pass against each: - Hallucinated actions — invented paths, env vars, IDs, function names, library APIs not in the contract or codebase? - Scope creep — files or behaviors touched outside the brief? - Cascading errors — silent fallbacks, swallowed exceptions, suppressed failures that hide root cause? - Context loss — contradicts the contract, CONTEXT.md, ADRs, or established conventions? - Tool misuse — wrong tool for the job, or right tool with wrong params? Report findings as a numbered list, each tagged severity (blocker / important / nit) and location (file:line). Do NOT propose fixes. Do NOT modify files. Do NOT commit, stage, or run git ops.
If using the
evanflow-overseersubagent type, your tool restrictions (read-only) enforce this — you literally cannot fix, only report."
Prefer subagent_type: evanflow-overseer (tool-restricted to enforce read-only review). Else any specialized code-reviewer subagent your environment provides, or general-purpose.
Overseers run in parallel — single message, multiple Agent calls.
After all coder/overseer pairs return, spawn ONE final overseer (use evanflow-overseer again, or any specialized code-reviewer subagent your environment provides). Inputs:
git diff against the working tree)"You're checking cohesion across multiple coders' outputs. Look for: (a) type mismatches at boundaries — one coder produces type X, another expects type X' (b) naming drift — resource called
Fooin one file,Foosin another,foo_idvsfooIdinconsistencies (c) invariants applied inconsistently — e.g., one router usesauthenticatedProcedure, another forgot (d) integration points that don't connect — coder A exports something coder B doesn't import, or shapes don't match (e) integration tests at touchpoints — for every touchpoint named in the contract, verify a passing integration test exists. Run it. Confirm it actually exercises the connection (not a stub or a mock). The integration test IS the executable contract; if it doesn't exist or doesn't verify, the cohesion guarantee is unproven. Report findings tagged by severity and affected files. Do NOT fix."
Orchestrator collects every overseer finding:
For revisions: spawn that coder again with: original brief + the finding + their existing diff + "fix only this finding; don't expand scope." Re-run that coder's overseer afterward.
Hard cap: 3 reconciliation rounds. If still issues at round 3, the original decomposition or the contract was wrong — stop, report state, ask the user.
When all overseers report clean (or remaining findings explicitly accepted):
tsc, lint, test:run for affected workspaces)A good final report:
<files> — "evanflow-tdd in Phase B. Vertical slices: one test → impl → next. Watch each test fail before writing the impl that passes it.evanflow-executing-plans for that subset)evanflow-executing-plans (sequential)evanflow-improve-architectureThe split exists because you can't trust a coder to be its own reviewer. A coder optimizes for "make my task pass." An overseer optimizes for "find what's wrong." Different incentives produce different attention. Combining them means review-during-implementation, which catches less.
The integration overseer exists because per-task overseers can't see the whole. Each one has a narrow window — the contract + one diff. Boundary mismatches between two diffs are invisible from inside either. The integration overseer's job is the cross-section view.
The cohesion contract is prose. Integration tests are executable. Prose contracts drift the moment two people read them differently. A failing integration test cannot drift — either it passes, or someone's wrong. Forcing integration tests at every touchpoint converts cohesion from a hope into a guarantee.
The RED checkpoint catches the cheapest class of failures cheaply. A test that imports the wrong file, a test that asserts on the wrong field, a test that doesn't actually run — all of these are invisible while you're writing implementation. Catching them before any coder writes real code saves the entire coder + overseer cycle.
Vertical slices per coder prevent imagined-behavior tests. If a coder writes 7 tests up front and then 7 implementations, the tests describe what the coder thought the system would do, not what it does. One test → one impl → next test forces the tests to track what the code actually does.