Use when recovering from agent failures or coordinating agent replacements. Trigger with failure events.
npx claudepluginhub emasoft/emasoft-plugins --plugin emasoft-chief-of-staffThis skill uses the workspace's default tool permissions.
This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.
references/agent-replacement-protocol.mdreferences/examples.mdreferences/failure-classification.mdreferences/failure-detection.mdreferences/op-classify-failure-severity.mdreferences/op-detect-agent-failure.mdreferences/op-emergency-handoff.mdreferences/op-execute-recovery-strategy.mdreferences/op-replace-agent.mdreferences/op-route-task-blocker.mdreferences/recovery-operations.mdreferences/recovery-strategies.mdreferences/troubleshooting.mdreferences/work-handoff-during-failure.mdHandles agent replacement in AI Maestro by compiling task context from GitHub issues, kanban cards, and message history, generating handoff documents, reassigning tasks, and confirming with AMCOS.
Implements circuit breaker pattern for agentic tool calls: tracks health via closed/open/half-open states, reduces scope on failures, routes to alternatives, enforces failure budgets. For fault-tolerant agent workflows.
Guides building reliable autonomous AI agents with ReAct/Plan-Execute loops, reflection patterns, goal decomposition, and frameworks like LangGraph/CrewAI. For production agent reliability.
Share bugs, ideas, or general feedback.
This skill teaches the Emasoft Chief of Staff (ECOS) how to detect, classify, and recover from agent failures in a multi-agent system coordinated via AI Maestro messaging.
When to use this skill:
Before using this skill, ensure:
Copy this checklist and track your progress:
## ECOS Failure Response Checklist
Agent: _______________
Failure detected: _______________
### Detection
- [ ] Heartbeat status checked
- [ ] AI Maestro agent status queried
- [ ] Message delivery verified
- [ ] Task progress reviewed
### Classification
- [ ] Failure type determined: [ ] Transient [ ] Recoverable [ ] Terminal
- [ ] Evidence documented
- [ ] Incident logged
### Response (choose path)
#### If Transient:
- [ ] Waited for auto-recovery (< 5 min)
- [ ] Verified agent responsive
- [ ] Resumed normal monitoring
#### If Recoverable:
- [ ] Manager notified
- [ ] Recovery strategy selected
- [ ] Recovery attempted
- [ ] Recovery verified OR escalated to replacement
#### If Terminal:
- [ ] Manager notified
- [ ] Replacement approval requested
- [ ] Artifacts preserved
- [ ] Replacement agent created
- [ ] Orchestrator notified
- [ ] Handoff documentation sent
- [ ] New agent acknowledged
- [ ] Incident closed
### Emergency Handoff (if deadline critical):
- [ ] Critical tasks identified
- [ ] Orchestrator notified
- [ ] Receiving agent assigned
- [ ] Handoff documentation created
- [ ] Work transferred
- [ ] Deadline met OR escalated
| Recovery Type | Output |
|---|---|
| Agent restart | Agent back online, state restored |
| Communication | Message queue cleared, connection restored |
| State | Corrupted state replaced with backup |
DETECT --> CLASSIFY --> RESPOND
| | |
v v v
Heartbeat Transient? Wait & Retry
timeout? --> Yes --> (auto-recover)
| |
Message No
delivery |
failed? v
| Recoverable?
Agent --> Yes --> Restart / Wake
offline? | (intervention needed)
|
No
|
v
Terminal --> Replace Agent
(full protocol)
| Phase | Action | Reference Document |
|---|---|---|
| 1 | Detect failure | failure-detection.md |
| 2 | Classify severity | failure-classification.md |
| 3 | Attempt recovery | recovery-strategies.md |
| 4 | Replace if terminal | agent-replacement-protocol.md |
| 5 | Emergency handoff | work-handoff-during-failure.md |
Before responding to a failure, ECOS must first detect that a failure has occurred.
Read references/failure-detection.md for:
| Mechanism | Signal | Response Time |
|---|---|---|
| Heartbeat timeout | Missed pings | 30-60 seconds |
| Message delivery failure | API error | Immediate |
| Message acknowledgment timeout | No ACK | 5-15 minutes |
| Task completion timeout | Stalled progress | Variable |
Once detected, classify severity to determine response.
Read references/failure-classification.md for:
| Category | Severity | Recovery | Example |
|---|---|---|---|
| Transient | Low | Automatic (< 5 min) | Network hiccup, API rate limit |
| Recoverable | Medium | With intervention | Session hibernated, out of memory |
| Terminal | High | Replacement required | Host crash, disk corruption |
For transient and recoverable failures, attempt recovery before escalating.
Read references/recovery-strategies.md for:
| Strategy | When to Use | Time to Recover |
|---|---|---|
| Wait and Retry | Transient failures | 1-5 minutes |
| Restart | Hung/crashed agent | 5-15 minutes |
| Hibernate-Wake | Idle/suspended session | 2-5 minutes |
| Resource Adjustment | Memory/disk exhaustion | 15-60 minutes |
| Replace | All above failed | 30-120 minutes |
When recovery fails or failure is terminal, create a replacement agent.
Read references/agent-replacement-protocol.md for:
ECOS detects terminal failure
|
v
ECOS notifies EAMA (manager) --> EAMA approves
|
v
ECOS coordinates new agent creation
|
v
ECOS notifies EOA (orchestrator) to:
- Generate handoff document
- Update GitHub Project kanban
|
v
ECOS sends handoff docs to new agent
|
v
New agent acknowledges and begins work
CRITICAL: The replacement agent has NO MEMORY of the old agent.
The new agent does not know what tasks were assigned, what work was in progress, or the project context. Therefore:
ROLE BOUNDARY: ECOS creates agents and sends context. EOA owns task assignment.
When critical work cannot wait for full replacement protocol.
Read references/work-handoff-during-failure.md for:
| Aspect | Regular Handoff | Emergency Handoff |
|---|---|---|
| Timing | After replacement ready | Immediately |
| Completeness | Full context | Minimum viable |
| Recipient | Replacement agent | Any available agent |
| Duration | Permanent | Temporary |
ECOS handles TWO types of escalations differently:
An agent failure occurs when an agent crashes, becomes unresponsive, or repeatedly fails. ECOS handles this by:
A task blocker occurs when work cannot proceed due to missing information, access, or a decision that only the user can make. When ECOS receives a task blocker escalation from EOA:
Note: Use the
agent-messagingskill to send messages. The JSON structure below shows the message content.
{
"from": "ecos-chief-of-staff",
"to": "eama-assistant-manager",
"subject": "BLOCKER: Task requires user decision",
"priority": "high",
"content": {
"type": "blocker-escalation",
"message": "A task is blocked and requires user input. EOA has escalated this after determining the blocker cannot be resolved by agents.",
"task_uuid": "[task-uuid]",
"issue_number": "[GitHub issue number of the blocked task]",
"blocker_issue_number": "[GitHub issue number tracking the blocker problem]",
"blocker_type": "user-decision",
"blocker_description": "[What is blocking and why agents cannot resolve it]",
"impact": "[Affected agents and tasks]",
"options": ["[Options if available]"],
"escalated_from": "eoa-[project-name]",
"original_blocker_time": "[ISO8601 timestamp]"
}
}
ECOS receives escalation
│
├─ Is it an agent failure? (crash, unresponsive, repeated failure)
│ └─ YES → Handle via failure recovery workflow (this skill)
│
├─ Is it a task blocker that ECOS can resolve?
│ ├─ Agent reassignment → Handle directly
│ └─ Permission within authority → Handle directly
│
└─ Is it a task blocker requiring user input?
└─ YES → Route to EAMA using blocker-escalation template above
Copy this checklist and track your progress:
blocker_issue_number in the message (the GitHub issue tracking the blocker problem)Copy this checklist and track your progress:
| Data | Location |
|---|---|
| Heartbeat configuration | $CLAUDE_PROJECT_DIR/.ecos/agent-health/heartbeat-config.json |
| Task tracking | $CLAUDE_PROJECT_DIR/.ecos/agent-health/task-tracking.json |
| Incident log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/incident-log.jsonl |
| Recovery log | $CLAUDE_PROJECT_DIR/.ecos/agent-health/recovery-log.jsonl |
| Handoff documents | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/AGENT_NAME/ |
| Emergency handoffs | $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/emergency/ |
| Situation | Priority | Message Type |
|---|---|---|
| Transient failure (pattern) | normal | escalation |
| Recoverable failure detected | high | failure-report |
| Recovery attempt failed | high | failure-report |
| Terminal failure detected | urgent | replacement-request |
| Emergency handoff initiated | urgent | emergency-handoff-notification |
| Replacement complete | normal | replacement-complete |
Step-by-step runbooks for executing individual failure recovery operations. Use these when performing a specific operation within the failure recovery workflow.
Common issues when recovering from agent failures.
Read references/troubleshooting.md for:
Before sending any handoff document (regular or emergency), validate using this checklist:
### Handoff Validation Checklist
Before sending handoff:
- [ ] All required fields present (from/to/type/UUID/task)
- [ ] UUID is unique (check existing handoffs: `ls $CLAUDE_PROJECT_DIR/thoughts/shared/handoffs/`)
- [ ] Target agent exists and is alive (use the `ai-maestro-agents-management` skill to list agents and verify the target is online)
- [ ] All referenced files exist (`test -f <path> && echo "EXISTS" || echo "MISSING"`)
- [ ] No placeholder [TBD] markers (`grep -r "\[TBD\]" handoff.md`)
- [ ] Document is valid markdown (no broken links, proper formatting)
- [ ] Acceptance criteria clearly defined
- [ ] Current state accurately reflects reality
- [ ] Contact information for questions provided
Required fields for failure recovery handoffs:
| Field | Description | Example |
|---|---|---|
from | Sending agent name | ecos-chief-of-staff |
to | Target agent name | replacement-agent-001 |
type | Handoff type | emergency-handoff, replacement-handoff |
UUID | Unique handoff identifier | EH-20250204-svgbbox-001 |
task | Task being handed off | Implement bounding box calculation |
failed_agent | Name of failed agent | libs-svg-svgbbox |
failure_reason | Why agent failed | Terminal crash - disk corruption |
| Error | Cause | Resolution |
|---|---|---|
| Agent unresponsive | Network issue or crash | Send ping, wait 30s, then classify |
| Recovery failed | State corrupted | Escalate to terminal, request replacement |
| Handoff rejected | Target agent busy | Queue handoff, retry in 5 minutes |
| AI Maestro unavailable | Server down | Use fallback file-based communication |
Recovery scenarios with step-by-step commands.
Read references/examples.md for: