From claude-swarm
Diagnostic and recovery guidance for swarm coordination issues. Use this skill when you encounter 'spawn failed', need to 'diagnose team', 'fix swarm', resolve 'status mismatch', perform 'recovery', troubleshoot kitty/tmux issues, or deal with session crashes, multiplexer problems, or teammate failures. Covers diagnostics, spawn failures, status mismatches, recovery procedures, and common error patterns.
npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin und3rf10w-claude-litterThis skill uses the workspace's default tool permissions.
This skill provides comprehensive diagnostic and recovery procedures for swarm coordination issues.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
This skill provides comprehensive diagnostic and recovery procedures for swarm coordination issues.
# You try to spawn a teammate
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "..."
# Error: Could not find a valid kitty socket
# 1. Run diagnostics to identify the issue
/claude-swarm:swarm-diagnose my-team
# Output shows: kitty socket not found at expected location
# 2. Check kitty config
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# 3. Fix: Add to kitty.conf if missing
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 4. Restart kitty completely and retry spawn
# 1. Check if teammates are actually alive
/claude-swarm:swarm-verify my-team
# Output: backend-dev: not found (session crashed)
# 2. Find status mismatches
/claude-swarm:swarm-reconcile my-team
# Output: backend-dev marked active but session missing - recommend removal
# 3. Resume the team (respawns offline members)
/claude-swarm:swarm-resume my-team
# After rebooting, team config shows active but all sessions are gone
# 1. Check current state
/claude-swarm:swarm-status my-team
# Shows: 3 members active, but multiplexer shows no sessions
# 2. Reconcile to auto-detect mismatches
/claude-swarm:swarm-reconcile my-team --auto-fix
# Automatically marks offline sessions as inactive
# 3. Resume team to respawn all members
/claude-swarm:swarm-resume my-team
Quick diagnostic rule: Always start with /claude-swarm:swarm-diagnose <team> - it runs all health checks and points you to the specific issue.
When using delegation mode (default), a spawned team-lead handles coordination. This affects how you troubleshoot.
| Issue Type | Who Should Diagnose | Commands |
|---|---|---|
| Team-lead unresponsive | You (orchestrator) | /swarm-diagnose, /swarm-status |
| Worker issues | Team-lead (first), then you | Ask team-lead to run /swarm-diagnose |
| Communication failures | Team-lead (first) | Ask team-lead to check and report |
| Task management issues | Team-lead | Team-lead manages tasks |
If team-lead is working, ask them to diagnose:
/claude-swarm:swarm-message team-lead "Please run /swarm-diagnose and report any issues"
# Or be more specific:
/claude-swarm:swarm-message team-lead "Worker backend-dev seems stuck. Can you verify they're alive and check their status?"
Why delegate diagnosis? Team-lead has full context of the team state and can both diagnose and fix issues directly.
If team-lead isn't responding, diagnose directly:
# 1. Check team status
/claude-swarm:swarm-status my-team
# 2. Is team-lead alive?
# Look for "team-lead" in status output - does window exist?
# 3. Run full diagnostics
/claude-swarm:swarm-diagnose my-team
# 4. If team-lead crashed, respawn them
/claude-swarm:swarm-reconcile my-team
/claude-swarm:swarm-spawn "team-lead" "team-lead" "sonnet" "You are the team-lead. Check /swarm-inbox for context. Resume coordination."
Intervene yourself when:
Let team-lead handle when:
# View raw team state (bypassing team-lead)
/claude-swarm:swarm-status my-team
/claude-swarm:task-list
# Diagnose directly
/claude-swarm:swarm-diagnose my-team
# Message workers directly (if team-lead down)
/claude-swarm:swarm-message backend-dev "Team-lead is unresponsive. What's your current status?"
# Broadcast to all (emergency)
/claude-swarm:swarm-broadcast "Team-lead is down. Please pause work and report status."
Swarm coordination involves multiple moving parts: multiplexers (tmux/kitty), Claude Code processes, file system state, and network communication. When issues arise, systematic diagnosis is essential.
First, identify the symptom category:
Always start with diagnostics before attempting fixes:
# Comprehensive health check - runs all diagnostics
/claude-swarm:swarm-diagnose <team-name>
# Check if teammates are actually alive
/claude-swarm:swarm-verify <team-name>
# Find and report status mismatches
/claude-swarm:swarm-reconcile <team-name>
# View current team state (members, tasks, multiplexer)
/claude-swarm:swarm-status <team-name>
What these commands check:
Issue Detected
│
├─ Can't spawn teammates?
│ └─ Run: /claude-swarm:swarm-diagnose <team>
│ ├─ "Multiplexer not found" → Install tmux/kitty
│ ├─ "Socket not found" → Check kitty config, restart kitty
│ ├─ "Duplicate name" → Use unique name or check existing teammates
│ └─ "Timeout" → Check system resources, retry
│
├─ Status shows teammates but they're not responding?
│ └─ Run: /claude-swarm:swarm-verify <team>
│ └─ Shows "not found" → Sessions crashed
│ └─ Run: /claude-swarm:swarm-reconcile <team>
│ └─ Then: /claude-swarm:swarm-resume <team>
│
├─ Messages not being received?
│ └─ Check: /claude-swarm:swarm-status <team>
│ ├─ Teammate shows "offline" → Respawn teammate
│ ├─ Wrong agent name used → Check exact names
│ └─ Teammate not checking inbox → Send reminder
│
└─ Task commands failing?
└─ Run: /claude-swarm:task-list
└─ Verify task ID exists, check status values
Spawn failures are the most common issue when creating swarm teams. Understanding the spawn process helps diagnose failures quickly.
How spawning works:
Symptoms of spawn failure:
spawn_teammate or /claude-swarm:swarm-spawn returns errorImmediate diagnostic steps:
/claude-swarm:swarm-diagnose <team-name>
# For kitty users
kitten @ ls # Should list windows without error
# For tmux users
tmux list-sessions # Should list sessions without error
# Check Claude Code is working
claude --version # Should show version number
Troubleshooting workflow:
Spawn Command Fails
│
├─ Error mentions "multiplexer"?
│ └─ YES → See "Multiplexer Not Available" below
│
├─ Error mentions "socket"?
│ └─ YES → See "Kitty Socket Issues" below
│
├─ Error mentions "duplicate" or "already exists"?
│ └─ YES → See "Duplicate Agent Names" below
│
├─ Error mentions "timeout"?
│ └─ YES → See "Session Creation Timeout" below
│
├─ Error mentions "invalid" or "path traversal"?
│ └─ YES → See "Path Traversal Validation" below
│
└─ No clear error but spawn fails silently?
└─ Check: System resources, permissions, Claude Code installation
Common Causes:
Error:
Error: Neither tmux nor kitty is available
Solution:
# Install tmux (macOS)
brew install tmux
# Or install kitty
brew install --cask kitty
# Verify installation
which tmux # or: which kitty
Error:
Error: Agent name 'backend-dev' already exists in team
Solution:
# Use unique names
/claude-swarm:swarm-spawn "backend-dev-2" "backend-developer" "sonnet" "..."
# Or check existing teammates first
/claude-swarm:swarm-status <team-name>
Error (kitty):
Error: Could not find a valid kitty socket
Solution:
# 1. Verify kitty config has remote control enabled
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# Should show:
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 2. Check socket exists (kitty appends -PID to path)
ls -la /tmp/kitty-$(whoami)-*
# 3. Test socket connectivity
kitten @ ls
# 4. Restart kitty completely if needed (not just reload)
# 5. Or manually set socket path
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Note: Kitty creates sockets at /tmp/kitty-$USER-$PID. The plugin auto-discovers the correct socket, but if you have multiple kitty instances, you may need to set KITTY_LISTEN_ON explicitly.
Deep dive on kitty socket discovery:
The spawn process tries sockets in this order:
$KITTY_LISTEN_ON environment variable (if set and valid)/tmp/kitty-$USER-$KITTY_PID (exact match for current kitty)/tmp/kitty-$USER-* sockets (newest first)/tmp/kitty-$USER (fallback)/tmp/mykitty and /tmp/kitty (alternative locations)Each socket is validated with kitten @ --to $socket ls before use. If validation fails, the search continues.
Multiple kitty instances troubleshooting:
If you have multiple kitty windows open:
# List all kitty sockets
ls -la /tmp/kitty-$(whoami)-*
# Example output:
# /tmp/kitty-user-12345 (kitty window 1)
# /tmp/kitty-user-67890 (kitty window 2)
# Test each socket
kitten @ --to unix:/tmp/kitty-user-12345 ls
kitten @ --to unix:/tmp/kitty-user-67890 ls
# Set the correct socket for your team-lead window
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Configuration file location varies:
~/.config/kitty/kitty.conf~/.config/kitty/kitty.conf or ~/Library/Preferences/kitty/kitty.confkitty --debug-config | grep "Config file"Common kitty config issues:
listen_on unix:/path, not listen_on /pathExample working kitty.conf:
# ~/.config/kitty/kitty.conf
allow_remote_control yes
listen_on unix:/tmp/kitty-$USER
# Note: $USER expands at kitty startup, then -$PID is appended automatically
Socket permission issues:
# Check socket permissions
ls -la /tmp/kitty-$(whoami)-*
# Should show: srw------- (socket, owner read-write-execute only)
# If permissions are wrong:
# 1. Kill kitty completely
# 2. Remove old sockets: rm /tmp/kitty-$(whoami)-*
# 3. Restart kitty (will recreate with correct permissions)
Error:
Error: Invalid team name (path traversal detected)
Solution:
# Use simple team names without special characters
# Good: "auth-team", "feature-x", "bugfix_123"
# Bad: "../other-team", "team/name", "team..name"
Error:
Error: Timeout waiting for teammate session to start
Solution:
# Retry once (may be transient)
/claude-swarm:swarm-spawn "agent-name" ...
# Check system resources
top # Look for high CPU/memory usage
# Verify multiplexer is responsive
tmux list-sessions # or: kitty @ ls
Recovery Steps:
Symptoms:
Diagnosis:
/claude-swarm:swarm-reconcile <team-name>
This will report:
Common Causes:
Detection:
# Config shows active, but session doesn't exist
/claude-swarm:swarm-verify <team-name>
# Output: "Error: Session swarm-team-agent not found"
Solution:
# Run reconcile to update status
/claude-swarm:swarm-reconcile <team-name>
# Respawn the teammate
/claude-swarm:swarm-spawn "agent-name" "agent-type" "model" "prompt"
# Or resume the team (respawns all offline)
/claude-swarm:swarm-resume <team-name>
Detection: User manually killed tmux/kitty session outside of cleanup command
Solution:
# Reconcile will detect and fix
/claude-swarm:swarm-reconcile <team-name>
# Respawn if needed
/claude-swarm:swarm-spawn "agent-name" ...
Detection: Sessions killed but config files remain
Solution:
# Run cleanup properly
/claude-swarm:swarm-cleanup <team-name> --force
# Or manually remove config
rm ~/.claude/teams/<team-name>/config.json
Symptoms:
Diagnosis:
# Check team status
/claude-swarm:swarm-status <team-name>
# Verify teammate is alive
/claude-swarm:swarm-verify <team-name>
# Check inbox manually
cat ~/.claude/teams/<team-name>/inboxes/<agent-name>.json
Common Causes:
Solution:
/claude-swarm:swarm-inbox regularlyError:
Error: Agent 'backend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name from status output
/claude-swarm:swarm-message "backend-dev" "message" # Not "backend"
Symptoms: Inbox command fails or shows garbled output
Solution:
# Back up current inbox
cp ~/.claude/teams/<team-name>/inboxes/<agent>.json ~/.claude/teams/<team-name>/inboxes/<agent>.json.bak
# Reset inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Notify sender to resend messages
Symptoms:
Diagnosis:
# View current tasks
/claude-swarm:task-list
# Check task files directly
ls ~/.claude/tasks/<team-name>/*.json
Common Causes:
Error:
Error: Task #99 not found
Solution:
# List tasks to see valid IDs
/claude-swarm:task-list
# Use correct ID from list
/claude-swarm:task-update 3 --status "in_progress"
Error:
Error: Invalid status 'done'
Solution:
# Use valid status values:
# - pending
# - in_progress
# - blocked
# - in_review
# - completed
/claude-swarm:task-update 3 --status "completed" # Not "done"
Error:
Error: Agent 'frontend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name
/claude-swarm:task-update 3 --assign "frontend-dev"
Symptoms:
Diagnosis:
# Check if team directory exists
ls -la ~/.claude/teams/<team-name>/
# Check permissions
ls -la ~/.claude/teams/
Common Causes:
Error:
Error: Team 'my-team' already exists
Solution:
# Choose different name
/claude-swarm:swarm-create "my-team-2" "description"
# Or cleanup old team first
/claude-swarm:swarm-cleanup "my-team" --force
Error:
Error: Permission denied creating ~/.claude/teams/my-team/
Solution:
# Fix permissions on Claude directory
chmod 700 ~/.claude/
chmod 700 ~/.claude/teams/
# Retry creation
/claude-swarm:swarm-create "my-team" "description"
Error:
Error: Invalid team name
Solution:
# Use alphanumeric with hyphens/underscores
# Good: "feature-auth", "bugfix_123", "team2"
# Bad: "../team", "team name", "team/123"
When issues are diagnosed, choose the appropriate recovery approach. Three main strategies exist:
Soft Recovery - For minor issues (1-3 teammates offline, status mismatches):
/claude-swarm:swarm-reconcile <team-name> # Fix status mismatches
/claude-swarm:swarm-resume <team-name> # Respawn offline teammates
Partial Recovery - For specific component failures (corrupted inbox, broken task):
# Reset specific inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Fix specific task with jq
jq '.status = "in_progress"' ~/.claude/tasks/<team-name>/<id>.json > /tmp/task-fixed.json
Hard Recovery - For complete team failure (corrupted config, non-functional team):
/claude-swarm:swarm-cleanup <team-name> --force
/claude-swarm:swarm-create <team-name> "Team description"
# Recreate tasks and respawn teammates
| Symptom | Recommended Strategy | Recovery Time |
|---|---|---|
| 1-3 teammates offline | Soft (reconcile + resume) | 30-120 seconds |
| Status mismatch only | Soft (reconcile) | 10 seconds |
| Inbox corruption | Partial (reset inbox) | 30 seconds |
| Task file corrupt | Partial (fix task) | 1-2 minutes |
| Config corrupt | Hard (recreate) | 5-10 minutes |
| Everything broken | Hard (full reset) | 10-15 minutes |
For detailed recovery procedures, consult the Read tool to load references/recovery-procedures.md, which provides:
Prevention is significantly easier than recovery. Key practices:
Always verify teammates spawned successfully:
# After spawning team, ALWAYS verify
/claude-swarm:swarm-verify <team-name>
/claude-swarm:swarm-status <team-name>
Slash commands have built-in validation and error handling:
# Recommended: Use slash commands
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "Implement API"
# Avoid: Direct bash function calls (unless necessary)
Never retry blindly. Diagnose first, fix, then retry:
if ! /claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt"; then
/claude-swarm:swarm-diagnose <team-name> # Diagnose the issue
# Fix the underlying problem
# Then retry once
fi
For long-running teams (>1 hour), check health periodically:
# Every 15-30 minutes during active development
/claude-swarm:swarm-reconcile <team-name>
/claude-swarm:swarm-verify <team-name>
Always use cleanup commands, never manual deletion:
# Standard cleanup (preserves files for reference)
/claude-swarm:swarm-cleanup <team-name>
# Force cleanup (removes everything)
/claude-swarm:swarm-cleanup <team-name> --force
Provide comprehensive initial prompts:
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "You are the backend developer for team my-team. Your tasks: 1) Implement /api/users endpoint in src/api/users.ts, 2) Add database schema in migrations/. Current status: API routes defined, need implementation. Coordinate with frontend-dev for API contract. Check Task #3 for full requirements."
For detailed prevention techniques, consult references/recovery-procedures.md for:
When debugging, these environment variables are set for spawned teammates:
| Variable | Description |
|---|---|
CLAUDE_CODE_TEAM_NAME | Current team name |
CLAUDE_CODE_AGENT_ID | Agent's unique UUID |
CLAUDE_CODE_AGENT_NAME | Agent name (e.g., "backend-dev") |
CLAUDE_CODE_AGENT_TYPE | Agent role type |
CLAUDE_CODE_TEAM_LEAD_ID | Team lead's UUID |
CLAUDE_CODE_AGENT_COLOR | Agent display color |
KITTY_LISTEN_ON | Kitty socket path (kitty only) |
User-configurable:
| Variable | Description | Default |
|---|---|---|
SWARM_MULTIPLEXER | Force "tmux" or "kitty" | Auto-detect |
SWARM_KITTY_MODE | Kitty spawn mode | split |
| Issue | Quick Fix |
|---|---|
| Spawn fails | Run /claude-swarm:swarm-diagnose |
| Status mismatch | Run /claude-swarm:swarm-reconcile |
| Session crashed | Run /claude-swarm:swarm-resume |
| Messages not received | Verify agent name, check inbox |
| Invalid task ID | Run /claude-swarm:task-list to see IDs |
| Team creation fails | Check permissions, use valid name |
| Kitty socket not found | Check listen_on in kitty.conf, restart kitty |
| Cleanup incomplete | Use --force flag |
For detailed recovery and performance guidance, consult:
references/recovery-procedures.md - Comprehensive recovery strategies, performance troubleshooting, emergency procedures, and resource monitoring