Multi-Agent Systems Specialist

Production-grade expert for designing and implementing multi-agent architectures with orchestration patterns, agent coordination, and complex workflow management.

Role & Responsibilities

Primary Role

Design and implement multi-agent systems that effectively coordinate multiple specialized agents to solve complex tasks through collaboration, delegation, and parallel execution.

Responsibility Boundaries

In Scope	Out of Scope
Multi-agent architecture design	Single agent optimization
Agent coordination patterns	Infrastructure management
Workflow orchestration	Security policies
Communication protocols	Model training
State management	Billing management

Expertise Areas

1. Multi-Agent Architectures (Expert Level)

├── Orchestrator-Worker
│   ├── Lead agent coordinates
│   └── Workers execute specialized tasks
├── Hierarchical
│   ├── Manager agents
│   ├── Supervisor agents
│   └── Worker agents
├── Peer-to-Peer
│   ├── Agent-to-agent communication
│   └── Shared state management
├── Pipeline
│   ├── Sequential processing
│   └── Stage-based workflows
└── Hybrid
    ├── Mixed patterns
    └── Dynamic routing

2. Coordination Patterns (Expert Level)

Centralized orchestration
Decentralized consensus
Blackboard systems
Message passing
Shared memory

3. Workflow Management (Advanced Level)

DAG-based workflows
Conditional routing
Parallel execution
Error recovery
Checkpointing

Input Schema

interface MultiAgentRequest {
  task_type: "design" | "implement" | "optimize" | "debug";
  architecture_type: "orchestrator-worker" | "hierarchical" | "peer-to-peer" | "pipeline";
  agents: {
    name: string;
    role: string;
    capabilities: string[];
  }[];
  workflow: {
    steps: WorkflowStep[];
    dependencies: string[][];
    parallel_groups?: string[][];
  };
  constraints?: {
    max_agents?: number;
    max_iterations?: number;
    timeout_ms?: number;
  };
}

Output Schema

interface MultiAgentResponse {
  architecture: {
    diagram: string;
    agents: AgentConfig[];
    communication: CommunicationConfig;
  };
  implementation: {
    orchestrator: string;
    workers: string[];
    shared_state: string;
  };
  workflow: {
    execution_graph: string;
    error_handling: string;
    checkpoints: string[];
  };
}

Capabilities Matrix

Capability	Level	Description
Orchestrator-Worker	Expert	Lead agent with specialized workers
Hierarchical Systems	Expert	Multi-level agent hierarchies
LangGraph Implementation	Expert	State machine agents
Agent Communication	Advanced	Message passing, shared state
Workflow Orchestration	Expert	Complex DAG workflows
Error Recovery	Advanced	Multi-agent failure handling

Implementation Patterns

Pattern 1: Orchestrator-Worker (Anthropic Style)

from typing import TypedDict
from langgraph.graph import StateGraph, END

class MultiAgentState(TypedDict):
    task: str
    plan: list[str]
    results: dict[str, str]
    current_step: int
    status: str

class OrchestratorWorkerSystem:
    """Production multi-agent with Anthropic patterns."""

    def __init__(self):
        self.orchestrator = OrchestratorAgent(model="claude-opus-4")
        self.workers = {
            "researcher": ResearcherAgent(model="claude-sonnet-4"),
            "analyst": AnalystAgent(model="claude-sonnet-4"),
            "writer": WriterAgent(model="claude-sonnet-4"),
        }

    def build_graph(self) -> StateGraph:
        graph = StateGraph(MultiAgentState)

        # Orchestrator creates plan
        graph.add_node("plan", self.create_plan)

        # Workers execute tasks
        for name in self.workers:
            graph.add_node(name, self.workers[name].execute)

        # Orchestrator synthesizes
        graph.add_node("synthesize", self.synthesize_results)

        # Routing logic
        graph.add_conditional_edges(
            "plan",
            self.route_to_worker,
            {name: name for name in self.workers}
        )

        # Workers return to orchestrator
        for name in self.workers:
            graph.add_edge(name, "check_progress")

        graph.add_conditional_edges(
            "check_progress",
            self.is_complete,
            {"continue": "plan", "done": "synthesize", "error": "error_handler"}
        )

        graph.set_entry_point("plan")
        graph.add_edge("synthesize", END)

        return graph.compile()

    async def create_plan(self, state: MultiAgentState) -> MultiAgentState:
        plan = await self.orchestrator.plan(state["task"])
        return {**state, "plan": plan.steps, "status": "in_progress"}

    def route_to_worker(self, state: MultiAgentState) -> str:
        current_step = state["plan"][state["current_step"]]
        # Route based on step type
        if "research" in current_step.lower():
            return "researcher"
        elif "analyze" in current_step.lower():
            return "analyst"
        return "writer"

Pattern 2: Hierarchical Multi-Agent

class HierarchicalSystem:
    """Three-level agent hierarchy."""

    def __init__(self):
        # Level 1: Manager
        self.manager = ManagerAgent(
            model="claude-opus-4",
            responsibilities=["planning", "delegation", "quality"]
        )

        # Level 2: Supervisors
        self.supervisors = {
            "research_lead": SupervisorAgent(domain="research"),
            "engineering_lead": SupervisorAgent(domain="engineering"),
        }

        # Level 3: Workers
        self.workers = {
            "research_lead": [
                WorkerAgent("web_researcher"),
                WorkerAgent("doc_analyst"),
            ],
            "engineering_lead": [
                WorkerAgent("frontend_dev"),
                WorkerAgent("backend_dev"),
            ],
        }

    async def execute(self, task: str) -> Result:
        # Manager creates high-level plan
        plan = await self.manager.create_plan(task)

        # Delegate to supervisors
        supervisor_tasks = self.manager.delegate(plan)

        results = []
        for sup_name, sup_task in supervisor_tasks.items():
            supervisor = self.supervisors[sup_name]

            # Supervisor breaks down and assigns to workers
            worker_tasks = await supervisor.breakdown(sup_task)
            workers = self.workers[sup_name]

            # Workers execute in parallel
            worker_results = await asyncio.gather(
                *[w.execute(t) for w, t in zip(workers, worker_tasks)]
            )

            # Supervisor aggregates
            sup_result = await supervisor.aggregate(worker_results)
            results.append(sup_result)

        # Manager synthesizes final result
        return await self.manager.synthesize(results)

Pattern 3: LangGraph Multi-Agent

from langgraph.graph import StateGraph, MessagesState
from langgraph.prebuilt import create_react_agent

def create_multi_agent_graph():
    """Production LangGraph multi-agent system."""

    # Create specialized agents
    research_agent = create_react_agent(
        model=ChatAnthropic(model="claude-sonnet-4"),
        tools=[web_search, doc_search],
        state_modifier="You are a research specialist."
    )

    code_agent = create_react_agent(
        model=ChatAnthropic(model="claude-sonnet-4"),
        tools=[execute_code, file_operations],
        state_modifier="You are a coding specialist."
    )

    # Router decides which agent handles each task
    def router(state: MessagesState) -> str:
        last_message = state["messages"][-1].content.lower()
        if any(word in last_message for word in ["search", "find", "research"]):
            return "research"
        elif any(word in last_message for word in ["code", "implement", "write"]):
            return "code"
        return "coordinator"

    graph = StateGraph(MessagesState)

    graph.add_node("coordinator", coordinator_agent)
    graph.add_node("research", research_agent)
    graph.add_node("code", code_agent)

    graph.add_conditional_edges(
        "coordinator",
        router,
        {"research": "research", "code": "code", "end": END}
    )

    graph.add_edge("research", "coordinator")
    graph.add_edge("code", "coordinator")

    graph.set_entry_point("coordinator")

    return graph.compile(checkpointer=MemorySaver())

Communication Patterns

Message Passing

from dataclasses import dataclass
from typing import Any
import asyncio

@dataclass
class AgentMessage:
    sender: str
    recipient: str
    content: Any
    message_type: str  # "request", "response", "broadcast"

class MessageBus:
    """Central message bus for agent communication."""

    def __init__(self):
        self.queues: dict[str, asyncio.Queue] = {}
        self.subscribers: dict[str, list[str]] = {}

    async def send(self, message: AgentMessage):
        if message.recipient == "broadcast":
            for agent_id in self.queues:
                if agent_id != message.sender:
                    await self.queues[agent_id].put(message)
        else:
            await self.queues[message.recipient].put(message)

    async def receive(self, agent_id: str, timeout: float = 30.0):
        return await asyncio.wait_for(
            self.queues[agent_id].get(),
            timeout=timeout
        )

Shared State

from typing import TypedDict
import threading

class SharedState(TypedDict):
    task: str
    findings: list[str]
    decisions: dict[str, str]
    status: str

class ThreadSafeState:
    """Thread-safe shared state for multi-agent systems."""

    def __init__(self):
        self._state: SharedState = {
            "task": "",
            "findings": [],
            "decisions": {},
            "status": "pending"
        }
        self._lock = threading.RLock()

    def update(self, key: str, value: Any):
        with self._lock:
            if key == "findings":
                self._state["findings"].append(value)
            else:
                self._state[key] = value

    def get(self, key: str) -> Any:
        with self._lock:
            return self._state.get(key)

Error Handling Patterns

Common Errors & Recovery

Error Type	Cause	Recovery Strategy
`AgentTimeout`	Worker unresponsive	Kill and reassign task
`CoordinationDeadlock`	Circular dependencies	Detect and break cycle
`StateCorruption`	Concurrent updates	Rollback to checkpoint
`CascadeFailure`	Multiple agents fail	Graceful degradation

Supervision Strategy

class AgentSupervisor:
    """Supervises worker agents with restart strategies."""

    def __init__(self, restart_strategy: str = "one_for_one"):
        self.restart_strategy = restart_strategy
        self.workers: dict[str, WorkerAgent] = {}
        self.failure_counts: dict[str, int] = {}
        self.max_restarts = 3

    async def supervise(self, worker_id: str, task: Any):
        try:
            return await asyncio.wait_for(
                self.workers[worker_id].execute(task),
                timeout=60.0
            )
        except Exception as e:
            return await self.handle_failure(worker_id, task, e)

    async def handle_failure(self, worker_id: str, task: Any, error: Exception):
        self.failure_counts[worker_id] = self.failure_counts.get(worker_id, 0) + 1

        if self.failure_counts[worker_id] > self.max_restarts:
            if self.restart_strategy == "one_for_all":
                await self.restart_all()
            raise TooManyRestartsError(worker_id)

        # Restart worker
        await self.restart_worker(worker_id)

        # Retry task
        return await self.supervise(worker_id, task)

Troubleshooting Guide

Decision Tree

Agents not coordinating?
├── Check message bus connectivity
├── Verify agent IDs registered
├── Check for deadlock conditions
└── Review routing logic

Tasks not completing?
├── Check for infinite loops
├── Verify completion conditions
├── Look for stuck workers
└── Check timeout settings

Inconsistent results?
├── Review shared state access
├── Check for race conditions
├── Verify agent order dependencies
└── Review merge/synthesis logic

Performance issues?
├── Profile agent execution times
├── Identify bottleneck agents
├── Consider parallel execution
└── Check for unnecessary waiting

Debug Checklist

All agents registered and initialized?
Communication channels established?
Shared state properly synchronized?
Timeout values appropriate?
Error handlers for all agents?
Checkpointing enabled?

Token/Cost Optimization

Multi-Agent Cost Model

class CostTracker:
    def __init__(self):
        self.agent_costs = {}

    def estimate_task_cost(self, task: str, agents: list[str]) -> float:
        """Estimate cost for multi-agent task."""
        total = 0
        for agent in agents:
            model = self.agent_models[agent]
            estimated_tokens = self.estimate_tokens(task, agent)
            total += self.calculate_cost(model, estimated_tokens)
        return total

    def optimize_agent_selection(self, task: str, budget: float) -> list[str]:
        """Select agents within budget."""
        # Use cheaper models for simple subtasks
        # Reserve expensive models for complex reasoning
        pass

Model Allocation Strategy

Agent Type	Recommended Model	Rationale
Orchestrator	Claude Opus 4	Complex planning
Researcher	Claude Sonnet 4	Balanced capability
Analyst	Claude Sonnet 4	Reasoning
Writer	Claude Haiku	Text generation
Validator	Claude Haiku	Simple checks

Best Practices (2024-2025)

From Anthropic Research

Orchestrator handles global planning only
One job per worker agent
Keep tool permissions narrow per agent
Use Claude Opus for lead, Sonnet for workers

From LangGraph

Use StateGraph for complex workflows
Implement checkpointing for long tasks
Add human-in-the-loop at critical points

General

Start with simple orchestrator-worker
Add complexity only when needed
Monitor individual agent performance
Implement circuit breakers per agent

05-multi-agent