Diagnostic and recovery guidance for swarm coordination issues. Use this skill when you encounter 'spawn failed', need to 'diagnose team', 'fix swarm', resolve 'status mismatch', perform 'recovery', troubleshoot kitty/tmux issues, or deal with session crashes, multiplexer problems, or teammate failures. Covers diagnostics, spawn failures, status mismatches, recovery procedures, and common error patterns.
/plugin marketplace add Und3rf10w/claude-litter/plugin install und3rf10w-claude-swarm-plugins-claude-swarm@Und3rf10w/claude-litterThis skill inherits all available tools. When active, it can use any tool Claude has access to.
examples/spawn-failure-recovery.mdThis skill provides comprehensive diagnostic and recovery procedures for swarm coordination issues.
# You try to spawn a teammate
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "..."
# Error: Could not find a valid kitty socket
# 1. Run diagnostics to identify the issue
/claude-swarm:swarm-diagnose my-team
# Output shows: kitty socket not found at expected location
# 2. Check kitty config
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# 3. Fix: Add to kitty.conf if missing
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 4. Restart kitty completely and retry spawn
# 1. Check if teammates are actually alive
/claude-swarm:swarm-verify my-team
# Output: backend-dev: not found (session crashed)
# 2. Find status mismatches
/claude-swarm:swarm-reconcile my-team
# Output: backend-dev marked active but session missing - recommend removal
# 3. Resume the team (respawns offline members)
/claude-swarm:swarm-resume my-team
# After rebooting, team config shows active but all sessions are gone
# 1. Check current state
/claude-swarm:swarm-status my-team
# Shows: 3 members active, but multiplexer shows no sessions
# 2. Reconcile to auto-detect mismatches
/claude-swarm:swarm-reconcile my-team --auto-fix
# Automatically marks offline sessions as inactive
# 3. Resume team to respawn all members
/claude-swarm:swarm-resume my-team
Quick diagnostic rule: Always start with /claude-swarm:swarm-diagnose <team> - it runs all health checks and points you to the specific issue.
When using delegation mode (default), a spawned team-lead handles coordination. This affects how you troubleshoot.
| Issue Type | Who Should Diagnose | Commands |
|---|---|---|
| Team-lead unresponsive | You (orchestrator) | /swarm-diagnose, /swarm-status |
| Worker issues | Team-lead (first), then you | Ask team-lead to run /swarm-diagnose |
| Communication failures | Team-lead (first) | Ask team-lead to check and report |
| Task management issues | Team-lead | Team-lead manages tasks |
If team-lead is working, ask them to diagnose:
/claude-swarm:swarm-message team-lead "Please run /swarm-diagnose and report any issues"
# Or be more specific:
/claude-swarm:swarm-message team-lead "Worker backend-dev seems stuck. Can you verify they're alive and check their status?"
Why delegate diagnosis? Team-lead has full context of the team state and can both diagnose and fix issues directly.
If team-lead isn't responding, diagnose directly:
# 1. Check team status
/claude-swarm:swarm-status my-team
# 2. Is team-lead alive?
# Look for "team-lead" in status output - does window exist?
# 3. Run full diagnostics
/claude-swarm:swarm-diagnose my-team
# 4. If team-lead crashed, respawn them
/claude-swarm:swarm-reconcile my-team
/claude-swarm:swarm-spawn "team-lead" "team-lead" "sonnet" "You are the team-lead. Check /swarm-inbox for context. Resume coordination."
Intervene yourself when:
Let team-lead handle when:
# View raw team state (bypassing team-lead)
/claude-swarm:swarm-status my-team
/claude-swarm:task-list
# Diagnose directly
/claude-swarm:swarm-diagnose my-team
# Message workers directly (if team-lead down)
/claude-swarm:swarm-message backend-dev "Team-lead is unresponsive. What's your current status?"
# Broadcast to all (emergency)
/claude-swarm:swarm-broadcast "Team-lead is down. Please pause work and report status."
Swarm coordination involves multiple moving parts: multiplexers (tmux/kitty), Claude Code processes, file system state, and network communication. When issues arise, systematic diagnosis is essential.
First, identify the symptom category:
Always start with diagnostics before attempting fixes:
# Comprehensive health check - runs all diagnostics
/claude-swarm:swarm-diagnose <team-name>
# Check if teammates are actually alive
/claude-swarm:swarm-verify <team-name>
# Find and report status mismatches
/claude-swarm:swarm-reconcile <team-name>
# View current team state (members, tasks, multiplexer)
/claude-swarm:swarm-status <team-name>
What these commands check:
Issue Detected
│
├─ Can't spawn teammates?
│ └─ Run: /claude-swarm:swarm-diagnose <team>
│ ├─ "Multiplexer not found" → Install tmux/kitty
│ ├─ "Socket not found" → Check kitty config, restart kitty
│ ├─ "Duplicate name" → Use unique name or check existing teammates
│ └─ "Timeout" → Check system resources, retry
│
├─ Status shows teammates but they're not responding?
│ └─ Run: /claude-swarm:swarm-verify <team>
│ └─ Shows "not found" → Sessions crashed
│ └─ Run: /claude-swarm:swarm-reconcile <team>
│ └─ Then: /claude-swarm:swarm-resume <team>
│
├─ Messages not being received?
│ └─ Check: /claude-swarm:swarm-status <team>
│ ├─ Teammate shows "offline" → Respawn teammate
│ ├─ Wrong agent name used → Check exact names
│ └─ Teammate not checking inbox → Send reminder
│
└─ Task commands failing?
└─ Run: /claude-swarm:task-list
└─ Verify task ID exists, check status values
## Common Issues
### Spawn Failures
Spawn failures are the most common issue when creating swarm teams. Understanding the spawn process helps diagnose failures quickly.
**How spawning works**:
1. Validate team name and agent name (no path traversal, special chars)
2. Detect multiplexer (kitty or tmux)
3. For kitty: Find valid socket, create window with environment variables
4. For tmux: Create new session with environment variables
5. Launch Claude Code process with model and initial prompt
6. Register window/session and update config
7. Wait for Claude Code to become responsive
**Symptoms of spawn failure**:
- `spawn_teammate` or `/claude-swarm:swarm-spawn` returns error
- Error messages about multiplexer not found
- Session/window creation fails
- Timeout waiting for teammate to start
- Process starts but immediately crashes
**Immediate diagnostic steps**:
1. **Check error output** - The error message usually indicates root cause
2. **Run diagnostics**:
```bash
/claude-swarm:swarm-diagnose <team-name>
# For kitty users
kitten @ ls # Should list windows without error
# For tmux users
tmux list-sessions # Should list sessions without error
# Check Claude Code is working
claude --version # Should show version number
Troubleshooting workflow:
Spawn Command Fails
│
├─ Error mentions "multiplexer"?
│ └─ YES → See "Multiplexer Not Available" below
│
├─ Error mentions "socket"?
│ └─ YES → See "Kitty Socket Issues" below
│
├─ Error mentions "duplicate" or "already exists"?
│ └─ YES → See "Duplicate Agent Names" below
│
├─ Error mentions "timeout"?
│ └─ YES → See "Session Creation Timeout" below
│
├─ Error mentions "invalid" or "path traversal"?
│ └─ YES → See "Path Traversal Validation" below
│
└─ No clear error but spawn fails silently?
└─ Check: System resources, permissions, Claude Code installation
Common Causes:
Error:
Error: Neither tmux nor kitty is available
Solution:
# Install tmux (macOS)
brew install tmux
# Or install kitty
brew install --cask kitty
# Verify installation
which tmux # or: which kitty
Error:
Error: Agent name 'backend-dev' already exists in team
Solution:
# Use unique names
/claude-swarm:swarm-spawn "backend-dev-2" "backend-developer" "sonnet" "..."
# Or check existing teammates first
/claude-swarm:swarm-status <team-name>
Error (kitty):
Error: Could not find a valid kitty socket
Solution:
# 1. Verify kitty config has remote control enabled
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# Should show:
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 2. Check socket exists (kitty appends -PID to path)
ls -la /tmp/kitty-$(whoami)-*
# 3. Test socket connectivity
kitten @ ls
# 4. Restart kitty completely if needed (not just reload)
# 5. Or manually set socket path
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Note: Kitty creates sockets at /tmp/kitty-$USER-$PID. The plugin auto-discovers the correct socket, but if you have multiple kitty instances, you may need to set KITTY_LISTEN_ON explicitly.
Deep dive on kitty socket discovery:
The spawn process tries sockets in this order:
$KITTY_LISTEN_ON environment variable (if set and valid)/tmp/kitty-$USER-$KITTY_PID (exact match for current kitty)/tmp/kitty-$USER-* sockets (newest first)/tmp/kitty-$USER (fallback)/tmp/mykitty and /tmp/kitty (alternative locations)Each socket is validated with kitten @ --to $socket ls before use. If validation fails, the search continues.
Multiple kitty instances troubleshooting:
If you have multiple kitty windows open:
# List all kitty sockets
ls -la /tmp/kitty-$(whoami)-*
# Example output:
# /tmp/kitty-user-12345 (kitty window 1)
# /tmp/kitty-user-67890 (kitty window 2)
# Test each socket
kitten @ --to unix:/tmp/kitty-user-12345 ls
kitten @ --to unix:/tmp/kitty-user-67890 ls
# Set the correct socket for your team-lead window
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Configuration file location varies:
~/.config/kitty/kitty.conf~/.config/kitty/kitty.conf or ~/Library/Preferences/kitty/kitty.confkitty --debug-config | grep "Config file"Common kitty config issues:
listen_on unix:/path, not listen_on /pathExample working kitty.conf:
# ~/.config/kitty/kitty.conf
allow_remote_control yes
listen_on unix:/tmp/kitty-$USER
# Note: $USER expands at kitty startup, then -$PID is appended automatically
Socket permission issues:
# Check socket permissions
ls -la /tmp/kitty-$(whoami)-*
# Should show: srw------- (socket, owner read-write-execute only)
# If permissions are wrong:
# 1. Kill kitty completely
# 2. Remove old sockets: rm /tmp/kitty-$(whoami)-*
# 3. Restart kitty (will recreate with correct permissions)
Error:
Error: Invalid team name (path traversal detected)
Solution:
# Use simple team names without special characters
# Good: "auth-team", "feature-x", "bugfix_123"
# Bad: "../other-team", "team/name", "team..name"
Error:
Error: Timeout waiting for teammate session to start
Solution:
# Retry once (may be transient)
/claude-swarm:swarm-spawn "agent-name" ...
# Check system resources
top # Look for high CPU/memory usage
# Verify multiplexer is responsive
tmux list-sessions # or: kitty @ ls
Recovery Steps:
Symptoms:
Diagnosis:
/claude-swarm:swarm-reconcile <team-name>
This will report:
Common Causes:
Detection:
# Config shows active, but session doesn't exist
/claude-swarm:swarm-verify <team-name>
# Output: "Error: Session swarm-team-agent not found"
Solution:
# Run reconcile to update status
/claude-swarm:swarm-reconcile <team-name>
# Respawn the teammate
/claude-swarm:swarm-spawn "agent-name" "agent-type" "model" "prompt"
# Or resume the team (respawns all offline)
/claude-swarm:swarm-resume <team-name>
Detection: User manually killed tmux/kitty session outside of cleanup command
Solution:
# Reconcile will detect and fix
/claude-swarm:swarm-reconcile <team-name>
# Respawn if needed
/claude-swarm:swarm-spawn "agent-name" ...
Detection: Sessions killed but config files remain
Solution:
# Run cleanup properly
/claude-swarm:swarm-cleanup <team-name> --force
# Or manually remove config
rm ~/.claude/teams/<team-name>/config.json
Symptoms:
Diagnosis:
# Check team status
/claude-swarm:swarm-status <team-name>
# Verify teammate is alive
/claude-swarm:swarm-verify <team-name>
# Check inbox manually
cat ~/.claude/teams/<team-name>/inboxes/<agent-name>.json
Common Causes:
Solution:
/claude-swarm:swarm-inbox regularlyError:
Error: Agent 'backend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name from status output
/claude-swarm:swarm-message "backend-dev" "message" # Not "backend"
Symptoms: Inbox command fails or shows garbled output
Solution:
# Back up current inbox
cp ~/.claude/teams/<team-name>/inboxes/<agent>.json ~/.claude/teams/<team-name>/inboxes/<agent>.json.bak
# Reset inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Notify sender to resend messages
Symptoms:
Diagnosis:
# View current tasks
/claude-swarm:task-list
# Check task file directly
cat ~/.claude/tasks/<team-name>/tasks.json
Common Causes:
Error:
Error: Task #99 not found
Solution:
# List tasks to see valid IDs
/claude-swarm:task-list
# Use correct ID from list
/claude-swarm:task-update 3 --status "in-progress"
Error:
Error: Invalid status 'done'
Solution:
# Use valid status values:
# - pending
# - in-progress
# - blocked
# - in-review
# - completed
/claude-swarm:task-update 3 --status "completed" # Not "done"
Error:
Error: Agent 'frontend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name
/claude-swarm:task-update 3 --assign "frontend-dev"
Symptoms:
Diagnosis:
# Check if team directory exists
ls -la ~/.claude/teams/<team-name>/
# Check permissions
ls -la ~/.claude/teams/
Common Causes:
Error:
Error: Team 'my-team' already exists
Solution:
# Choose different name
/claude-swarm:swarm-create "my-team-2" "description"
# Or cleanup old team first
/claude-swarm:swarm-cleanup "my-team" --force
Error:
Error: Permission denied creating ~/.claude/teams/my-team/
Solution:
# Fix permissions on Claude directory
chmod 700 ~/.claude/
chmod 700 ~/.claude/teams/
# Retry creation
/claude-swarm:swarm-create "my-team" "description"
Error:
Error: Invalid team name
Solution:
# Use alphanumeric with hyphens/underscores
# Good: "feature-auth", "bugfix_123", "team2"
# Bad: "../team", "team name", "team/123"
Choosing the right recovery strategy depends on the severity of the issue, how much work would be lost, and whether the team can continue working. This section provides decision-making guidance for recovery scenarios.
Problem Diagnosed
│
├─ Are teammates still working successfully?
│ └─ YES → Use Soft Recovery (minimal disruption)
│ ├─ 1-2 teammates offline → Respawn just those teammates
│ ├─ Status mismatch only → Run reconcile
│ └─ Communication issue → Fix inbox, notify teammates
│
├─ Is critical work in progress?
│ └─ YES → Evaluate data loss risk
│ ├─ Work saved to files/commits? → Safe to use Hard Recovery
│ ├─ Work only in memory/history? → Try Partial Recovery first
│ └─ Uncertain? → Ask teammates to save work first
│
├─ Is the team completely non-functional?
│ └─ YES → Assess what can be salvaged
│ ├─ Tasks/config readable? → Use Partial Recovery
│ ├─ Files corrupted? → Use Hard Recovery
│ └─ Everything broken? → Nuclear option (full reset)
│
└─ Is this a persistent/recurring issue?
└─ YES → After recovery, investigate root cause
├─ Check system resources (disk, memory, CPU)
├─ Review multiplexer logs
└─ Consider reducing team size
When to use:
What's preserved:
What's affected:
Step-by-step soft recovery:
/claude-swarm:swarm-status <team-name>
# Look for members showing "no window" with config "active"
/claude-swarm:swarm-reconcile <team-name>
# This marks offline sessions as offline in config
# Option A: Respawn specific teammate
/claude-swarm:swarm-spawn "agent-name" "agent-type" "model" "Continue where you left off: [context]"
# Option B: Resume entire team (respawns all offline)
/claude-swarm:swarm-resume <team-name>
/claude-swarm:swarm-verify <team-name>
# All teammates should show as active
# Via bash function
source "${CLAUDE_PLUGIN_ROOT}/lib/swarm-utils.sh" 1>/dev/null
broadcast_message "<team-name>" "Recovery complete. Team member [name] has been respawned. Continue your work."
Example soft recovery scenario:
Situation: 5-teammate team, 2 teammates crashed mid-work
1. $ /claude-swarm:swarm-status my-team
Output shows:
- team-lead: active (you)
- frontend-dev: active ✓
- backend-dev: active ✗ (no window)
- tester: active ✗ (no window)
- reviewer: active ✓
2. $ /claude-swarm:swarm-reconcile my-team
Output:
- Marked backend-dev as offline
- Marked tester as offline
3. $ /claude-swarm:swarm-resume my-team
Output:
- Respawning: backend-dev
- Respawning: tester
- Both spawned successfully
4. $ /claude-swarm:swarm-verify my-team
Output: All teammates active ✓
5. Message team: "backend-dev and tester were respawned after crash. Please continue your assigned tasks."
Result: Team back to full capacity in ~60 seconds, no data lost
When to use:
What's lost:
What's preserved:
Before hard recovery checklist:
# 1. Save task list for reference
/claude-swarm:task-list > tasks-backup.txt
# 2. Check for uncommitted work
git status
# 3. Ask teammates to commit their work (if any are responsive)
/claude-swarm:swarm-message "backend-dev" "Commit your work immediately, team restart needed"
# 4. Back up configs (optional)
cp ~/.claude/teams/<team-name>/config.json ~/config-backup.json
# 5. Document current state
/claude-swarm:swarm-status <team-name> > status-backup.txt
Step-by-step hard recovery:
/claude-swarm:swarm-cleanup <team-name> --force
# Check no sessions remain
tmux list-sessions | grep <team-name> # Should be empty
# or for kitty:
kitten @ ls | grep swarm-<team-name> # Should be empty
# Check team directory
ls ~/.claude/teams/<team-name>/
# Should not exist if --force was used
/claude-swarm:swarm-create <team-name> "Team description"
# Recreate each task manually
/claude-swarm:task-create "Implement API endpoints" "Full description..."
/claude-swarm:task-create "Write unit tests" "Test coverage for..."
# ... repeat for all tasks
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "You are the backend developer. Focus on: [task details]"
/claude-swarm:swarm-spawn "frontend-dev" "frontend-developer" "sonnet" "You are the frontend developer. Focus on: [task details]"
# ... repeat for all teammates
/claude-swarm:task-update 1 --assign "backend-dev"
/claude-swarm:task-update 2 --assign "frontend-dev"
/claude-swarm:swarm-verify <team-name>
/claude-swarm:swarm-status <team-name>
Timeline: Hard recovery typically takes 5-10 minutes for a 5-teammate team.
When to use:
Techniques:
When: Inbox file corrupted, messages malformed, inbox command errors
# Back up current inbox first
cp ~/.claude/teams/<team-name>/inboxes/<agent>.json ~/.claude/teams/<team-name>/inboxes/<agent>.json.bak
# Reset to empty inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Verify format
cat ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Should output: []
# Notify affected teammate
/claude-swarm:swarm-message "<agent>" "Your inbox was reset due to corruption. Please check your backup if you need message history."
When: Task file has invalid status, corrupted JSON, missing fields
# Back up task file
cp ~/.claude/tasks/<team-name>/<id>.json ~/.claude/tasks/<team-name>/<id>.json.bak
# Fix manually with jq
jq '.status = "in-progress"' ~/.claude/tasks/<team-name>/<id>.json > /tmp/task-fixed.json
mv /tmp/task-fixed.json ~/.claude/tasks/<team-name>/<id>.json
# Or edit directly
# Edit the JSON file to fix the issue
# Verify task is valid
cat ~/.claude/tasks/<team-name>/<id>.json | jq '.'
# Should output valid JSON
When: One teammate crashed, others working fine
# 1. Check teammate is really offline
/claude-swarm:swarm-verify <team-name>
# 2. Update their status
/claude-swarm:swarm-reconcile <team-name>
# 3. Check their assigned tasks
/claude-swarm:task-list
# Note which tasks were assigned to this teammate
# 4. Respawn with context
/claude-swarm:swarm-spawn "<agent-name>" "<agent-type>" "<model>" "You crashed mid-work. Resume: [describe what they were doing, which files they were editing, what tasks to continue]"
# 5. Reassign their tasks
/claude-swarm:task-update <task-id> --assign "<agent-name>"
/claude-swarm:task-update <task-id> --comment "Teammate respawned, resuming work"
# 6. Notify teammate of their context
/claude-swarm:swarm-message "<agent-name>" "You were working on: [specific context]. Check Task #<id> for details."
When: Config shows wrong status, but files and sessions are fine
# Use reconcile for automatic fixing
/claude-swarm:swarm-reconcile <team-name> --auto-fix
# Or manual fix if you know the issue
# Edit config.json directly:
# 1. Back up: cp ~/.claude/teams/<team-name>/config.json ~/config-backup.json
# 2. Edit: jq '(.members[] | select(.name == "agent-name")) |= (.status = "active")' config.json > config-fixed.json
# 3. Replace: mv config-fixed.json ~/.claude/teams/<team-name>/config.json
| Symptom | Data Loss Risk | Recommended Strategy | Recovery Time |
|---|---|---|---|
| 1 teammate offline | None | Soft (respawn one) | 30 seconds |
| Multiple offline | None | Soft (resume team) | 1-2 minutes |
| Status mismatch only | None | Soft (reconcile) | 10 seconds |
| Inbox corruption | Messages lost | Partial (reset inbox) | 30 seconds |
| Task file corrupt | Comments lost | Partial (fix task) | 1-2 minutes |
| Config corrupt | History lost | Hard (recreate) | 5-10 minutes |
| Everything broken | All lost | Hard (full reset) | 10-15 minutes |
| Persistent failures | Depends | Diagnose root cause first | Varies |
Some issues require more than recovery:
Signs you need to investigate deeper:
Investigation steps:
# Check system resources
top
# Look for: high CPU usage, low free memory, swap usage
# Check disk space
df -h ~/.claude
# Ensure adequate free space (>1GB recommended)
# Check file descriptor limits
ulimit -n
# Should be >=256, ideally >=1024
# Check for zombie processes
ps aux | grep claude
# Kill any orphaned Claude Code processes
# Review system logs
# macOS: Console.app, filter for "claude" or "kitty"
# Linux: journalctl --user | grep claude
Prevention is significantly easier than recovery. Following these practices reduces issues by 80-90%.
Why this matters: Spawn failures may not be immediately obvious. A teammate might appear to spawn successfully but crash seconds later, or spawn without proper environment variables set.
Verification workflow:
# After spawning team, ALWAYS verify
/claude-swarm:swarm-verify <team-name>
# Expected output for healthy team:
# Verifying team 'my-team'...
# ✓ team-lead (team-lead) - session active
# ✓ backend-dev (backend-developer) - session active
# ✓ frontend-dev (frontend-developer) - session active
# All teammates verified successfully!
# Check detailed status
/claude-swarm:swarm-status <team-name>
What to look for:
If verification fails immediately after spawn:
# Wait 5-10 seconds for Claude Code to fully initialize
sleep 10
/claude-swarm:swarm-verify <team-name>
# If still failing, check what's wrong
/claude-swarm:swarm-diagnose <team-name>
Slash commands have built-in validation, error handling, and safer parameter parsing compared to direct bash function calls.
Comparison:
# Slash command (RECOMMENDED)
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "Implement API"
# Direct bash function (AVOID unless necessary)
source "${CLAUDE_PLUGIN_ROOT}/lib/swarm-utils.sh" 1>/dev/null
spawn_teammate "team" "backend-dev" "backend-developer" "sonnet" "Implement API"
Slash command advantages:
When bash functions are acceptable:
Never retry blindly - understand why it failed first:
# BAD: Blind retry loop
for i in {1..5}; do
/claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt" && break
done
# GOOD: Diagnose then fix
if ! /claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt"; then
echo "Spawn failed, diagnosing..."
/claude-swarm:swarm-diagnose <team-name>
# Read diagnostic output, fix the issue, then retry once
# Example: Install missing multiplexer, fix socket, etc.
# Retry after fix
/claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt"
fi
Error handling best practices:
if ! /claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt" 2> spawn-error.log; then
echo "Spawn failed. Error log:"
cat spawn-error.log
# Now you have error details for debugging
fi
# Don't wait forever for unresponsive operations
timeout 30s /claude-swarm:swarm-verify <team-name>
# Before spawning team, check prerequisites
if [[ "$(detect_multiplexer)" == "none" ]]; then
echo "Error: No multiplexer available. Install tmux or kitty first."
exit 1
fi
For long-running teams (multiple hours or days), periodic health checks prevent gradual degradation.
Recommended check frequency:
Health check script:
#!/bin/bash
# save as: health-check.sh
TEAM="$1"
echo "=== Health Check: $TEAM ==="
echo ""
# Check for status drift
echo "Checking for status mismatches..."
/claude-swarm:swarm-reconcile "$TEAM"
# Verify all teammates
echo ""
echo "Verifying teammate sessions..."
/claude-swarm:swarm-verify "$TEAM"
# Check task progress
echo ""
echo "Task summary..."
/claude-swarm:task-list | grep -E "in-progress|blocked"
# Done
echo ""
echo "Health check complete!"
Automated monitoring (for critical/long-running teams):
# Add to cron or run in background
while true; do
/claude-swarm:swarm-verify <team-name> || {
echo "Health check failed at $(date)"
/claude-swarm:swarm-diagnose <team-name>
# Send notification, page on-call, etc.
}
sleep 900 # Check every 15 minutes
done
Why proper cleanup matters:
Cleanup best practices:
# Standard cleanup (safe, preserves files for reference)
/claude-swarm:swarm-cleanup <team-name>
# This kills sessions but leaves:
# - Config files
# - Task files
# - Inbox files
# - Logs
# Force cleanup (removes everything)
/claude-swarm:swarm-cleanup <team-name> --force
# This kills sessions AND removes:
# - ~/.claude/teams/<team-name>/
# - ~/.claude/tasks/<team-name>/
When to use each:
What NOT to do:
# NEVER manually delete while sessions are running
rm -rf ~/.claude/teams/<team-name>/ # Leaves orphaned sessions!
# NEVER kill sessions without cleanup
tmux kill-session -t swarm-<team>-<agent> # Leaves config!
# ALWAYS use cleanup commands
/claude-swarm:swarm-cleanup <team-name>
Cleanup verification:
# After cleanup, verify nothing remains
tmux list-sessions | grep <team-name> # Should be empty
ls ~/.claude/teams/<team-name>/ # Should not exist (if --force used)
Why monitoring matters: Large teams (5+ teammates) can consume significant resources. Each Claude Code process uses:
Resource monitoring:
# Check total Claude Code memory usage
ps aux | grep claude | awk '{sum+=$4} END {print "Total memory: " sum "%"}'
# Count active Claude processes
ps aux | grep claude | wc -l
# Check file descriptor usage
lsof -p $(pgrep claude) | wc -l
# Monitor system load
uptime
# Load average should be below CPU core count
Resource limits:
| Team Size | RAM Needed | Recommended System |
|---|---|---|
| 2-3 teammates | 2-3 GB | 8GB RAM minimum |
| 4-6 teammates | 3-5 GB | 16GB RAM recommended |
| 7-10 teammates | 6-8 GB | 32GB RAM recommended |
| 10+ teammates | 10+ GB | Not recommended without testing |
When to scale back:
# Reduce team size gracefully
# 1. Finish critical tasks
# 2. Have teammates commit work
# 3. Kill non-essential teammates
/claude-swarm:swarm-cleanup <team-name> # Only kills sessions for specific agents
# 4. Consolidate work across fewer teammates
Problem: Respawned teammates don't know what they were doing
Solution: Provide comprehensive initial prompts
Bad initial prompt:
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "Work on the backend"
Good initial prompt:
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "You are the backend developer for team my-team. Your tasks: 1) Implement /api/users endpoint in src/api/users.ts, 2) Add database schema in migrations/. Current status: API routes defined, need implementation. Coordinate with frontend-dev for API contract. Check Task #3 for full requirements."
Initial prompt template:
You are the [ROLE] for team [TEAM_NAME].
Your assigned tasks:
1. [TASK_1] - [STATUS]
2. [TASK_2] - [STATUS]
Current state:
- [What's done]
- [What's in progress]
- [What's blocked/dependencies]
Key files:
- [FILE_1]: [Description]
- [FILE_2]: [Description]
Coordinate with:
- [TEAMMATE_1]: [for what]
- [TEAMMATE_2]: [for what]
First action: [Specific next step]
For teams lasting >1 hour, document the architecture:
# Create team docs
cat > ~/.claude/teams/<team-name>/README.md <<EOF
# Team: <team-name>
## Purpose
[What this team is building]
## Members
- team-lead: Orchestration, task assignment
- backend-dev: API implementation, database
- frontend-dev: UI components, styling
- tester: Test coverage, QA
## Task Breakdown
- Task #1: [Description] - assigned to backend-dev
- Task #2: [Description] - assigned to frontend-dev
- Task #3: [Description] - assigned to tester
## Dependencies
- Task #2 depends on Task #1 (API contract)
- Task #3 depends on Task #1, #2 (working features)
## Recovery Notes
- If backend-dev crashes: They were editing src/api/, check git status
- If frontend-dev crashes: They were in src/components/, state in localStorage
EOF
This documentation is invaluable for recovery scenarios.
Symptoms:
Diagnosis:
# Check Claude Code process resource usage
ps aux | grep claude | sort -k3 -r # Sort by CPU%
ps aux | grep claude | sort -k4 -r # Sort by memory%
# Check individual teammate resource usage
# Find PID of specific teammate:
ps aux | grep "CLAUDE_CODE_AGENT_NAME=backend-dev"
# Monitor live resource usage
top -pid $(pgrep -f "CLAUDE_CODE_AGENT_NAME=backend-dev")
Common causes and solutions:
# Solution: Reduce team size, use lighter models
# Replace opus with sonnet, sonnet with haiku for non-critical tasks
/claude-swarm:swarm-spawn "tester" "tester" "haiku" "Run existing tests"
# Solution: Periodic restarts for long-lived teammates (>4 hours)
# 1. Ask teammate to commit work
# 2. Kill and respawn
# 3. Reassign tasks
# Check disk I/O
iostat -x 1 5 # Run 5 samples, 1 second apart
# Look for high %util on disk with ~/.claude
# Solution: Move ~/.claude to faster disk (SSD)
# Or reduce concurrent file operations
Kitty slowness:
# Check kitty window count
kitten @ ls | jq '[.[].tabs[].windows[]] | length'
# If >50 windows total, kitty may slow down
# Solution: Use SWARM_KITTY_MODE=os-window for separate processes
export SWARM_KITTY_MODE=os-window
/claude-swarm:swarm-spawn ...
Tmux slowness:
# Check tmux session count
tmux list-sessions | wc -l
# If >20 sessions, consider cleanup
# Solution: Clean up old swarm sessions
for session in $(tmux list-sessions -F '#{session_name}' | grep swarm-); do
# Check if session is active in a team
# If not, kill it
tmux kill-session -t "$session"
done
Symptoms:
Solutions:
# 1. Reduce team size to stay under rate limits
# 2. Stagger teammate spawning (wait 10s between spawns)
for agent in backend frontend tester; do
/claude-swarm:swarm-spawn "$agent" ...
sleep 10
done
# 3. Use haiku model for lightweight tasks (lower API load)
/claude-swarm:swarm-spawn "tester" "tester" "haiku" "Run unit tests"
Teammate completely frozen:
# 1. Find the teammate's process
ps aux | grep "CLAUDE_CODE_AGENT_NAME=backend-dev"
# 2. Send SIGTERM (graceful shutdown)
kill <PID>
# 3. If still frozen after 30s, force kill
kill -9 <PID>
# 4. Clean up and respawn
/claude-swarm:swarm-reconcile <team-name>
/claude-swarm:swarm-spawn "backend-dev" ...
Multiplexer frozen:
# Kitty frozen
# 1. Try sending command
kitten @ ls
# If hangs, kill kitty: killall kitty
# Tmux frozen
# 1. Try listing sessions
tmux list-sessions
# If hangs, kill tmux server: tmux kill-server
When to use: Everything is completely broken, no recovery methods work, starting over is the only option.
WARNING: This destroys ALL team data across ALL teams. Only use as absolute last resort.
What gets destroyed:
Before nuking:
# 1. Save what you can
tar -czf ~/swarm-backup-$(date +%Y%m%d-%H%M%S).tar.gz ~/.claude/teams/ ~/.claude/tasks/
# 2. Document current state
/claude-swarm:swarm-list-teams > ~/teams-backup.txt
for team in $(cat ~/teams-backup.txt); do
/claude-swarm:swarm-status "$team" > ~/${team}-status.txt
/claude-swarm:task-list >> ~/${team}-tasks.txt
done
# 3. Notify any responsive teammates
# (They'll lose their work context)
Full reset procedure:
# 1. Kill all swarm sessions
tmux kill-server # Kills ALL tmux sessions
# or for kitty:
for window in $(kitten @ ls | jq -r '.[].tabs[].windows[] | select(.user_vars | keys[] | startswith("swarm_")) | .id'); do
kitten @ close-window --match "id:$window"
done
# 2. Remove all swarm data
rm -rf ~/.claude/teams/
rm -rf ~/.claude/tasks/
# 3. Verify cleanup
ls ~/.claude/teams/ # Should not exist
ls ~/.claude/tasks/ # Should not exist
# 4. Recreate directories with proper permissions
mkdir -p ~/.claude/teams/
mkdir -p ~/.claude/tasks/
chmod 700 ~/.claude/teams/
chmod 700 ~/.claude/tasks/
# 5. Start fresh with new team
/claude-swarm:swarm-create "new-team" "Fresh start after full reset"
# 6. Verify clean state
/claude-swarm:swarm-status "new-team"
After nuclear reset:
Recovery timeline: 15-30 minutes to rebuild team from scratch.
For deep investigation:
# List all tmux sessions
tmux list-sessions
# Attach to specific teammate session (view their work)
tmux attach-session -t swarm-<team>-<agent>
# Check socket status
ls -la ~/.claude/sockets/
# View raw config
cat ~/.claude/teams/<team-name>/config.json
# View raw tasks
cat ~/.claude/tasks/<team-name>/tasks.json
# View raw inbox
cat ~/.claude/teams/<team-name>/inboxes/<agent>.json
When debugging, these environment variables are set for spawned teammates:
| Variable | Description |
|---|---|
CLAUDE_CODE_TEAM_NAME | Current team name |
CLAUDE_CODE_AGENT_ID | Agent's unique UUID |
CLAUDE_CODE_AGENT_NAME | Agent name (e.g., "backend-dev") |
CLAUDE_CODE_AGENT_TYPE | Agent role type |
CLAUDE_CODE_TEAM_LEAD_ID | Team lead's UUID |
CLAUDE_CODE_AGENT_COLOR | Agent display color |
KITTY_LISTEN_ON | Kitty socket path (kitty only) |
User-configurable:
| Variable | Description | Default |
|---|---|---|
SWARM_MULTIPLEXER | Force "tmux" or "kitty" | Auto-detect |
SWARM_KITTY_MODE | Kitty spawn mode | split |
| Issue | Quick Fix |
|---|---|
| Spawn fails | Run /claude-swarm:swarm-diagnose |
| Status mismatch | Run /claude-swarm:swarm-reconcile |
| Session crashed | Run /claude-swarm:swarm-resume |
| Messages not received | Verify agent name, check inbox |
| Invalid task ID | Run /claude-swarm:task-list to see IDs |
| Team creation fails | Check permissions, use valid name |
| Kitty socket not found | Check listen_on in kitty.conf, restart kitty |
| Cleanup incomplete | Use --force flag |
For more detailed information, see the error-handling reference documentation.
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.