From agent-dashboard
Root cause analysis for process crashes, server deaths, and unexplained system failures using macOS logs, session forensics, and code tracing
npx claudepluginhub bjornjee/agent-dashboard --plugin agent-dashboardThis skill uses the workspace's default tool permissions.
Root cause analysis for a system-level failure. **Gather ALL evidence before reasoning about the cause.**
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Root cause analysis for a system-level failure. Gather ALL evidence before reasoning about the cause.
Incident description: $ARGUMENTS
Follow these phases strictly in order. Do NOT speculate or reason about root cause until Phase 5. Every phase has a gate.
Parse the incident description — what died? (process, tmux server, container, service, etc.)
Establish the time window: ask the user or derive from context when the failure was noticed and when things last worked.
Identify what was running at the time — check:
tmux list-sessions / tmux list-panes -a (if tmux is back up)tail -500 ~/.zsh_history | while IFS= read -r line; do
if echo "$line" | grep -q "^: [0-9]"; then
ts=$(echo "$line" | sed 's/^: \([0-9]*\):.*/\1/')
cmd=$(echo "$line" | sed 's/^: [0-9]*:[0-9]*;//')
dt=$(date -r "$ts" "+%Y-%m-%d %H:%M:%S" 2>/dev/null)
echo "$dt $cmd"
fi
done
Identify all active agents/processes — check Claude Code session logs:
ls -lt ~/.claude/projects/<project-dir>/*.jsonl | head -15
Cross-reference modification times with the incident window.
Gate: Time window is established. List of active processes/sessions at the time is known.
Run ALL of the following. Do not skip any — even if you expect empty results, the absence of evidence is evidence.
log show --start "<start>" --end "<end>" --style compact \
--predicate 'process == "<target-process>"'
Replace <target-process> with the crashed process (e.g., "tmux", "tmux-server", "agent-dashboard"). Run for each relevant process name.
log show --start "<start>" --end "<end>" --style compact \
--predicate 'subsystem == "com.apple.AMFI" OR eventMessage CONTAINS "AMFI" OR eventMessage CONTAINS "code signature" OR eventMessage CONTAINS "CMS blob"'
log show --start "<start>" --end "<end>" --style compact \
--predicate 'eventMessage CONTAINS "SIGKILL" OR eventMessage CONTAINS "SIGTERM" OR eventMessage CONTAINS "SIGHUP" OR eventMessage CONTAINS "taskgated"'
log show --start "<start>" --end "<end>" --style compact \
--predicate 'category == "jetsam" OR eventMessage CONTAINS "memorystatus" OR eventMessage CONTAINS "jetsam"'
log show --start "<start>" --end "<end>" --style compact \
--predicate 'subsystem == "com.apple.powerd" OR eventMessage CONTAINS "Wake reason" OR eventMessage CONTAINS "DarkWake" OR eventMessage CONTAINS "caffeinate"'
log show --start "<start>" --end "<end>" --style compact \
--predicate 'sender == "kernel"' | head -50
find ~/Library/Logs/DiagnosticReports /Library/Logs/DiagnosticReports \
-name "*<process-name>*" -newer <reference-file> 2>/dev/null
Check for any crash/diagnostic files the application writes (e.g., crash.log, debug-keys.log, state files).
Gate: All 8 log categories checked. Results recorded with exact timestamps. Note which categories returned empty.
For each Claude Code session active during the incident window:
Extract all Bash tool calls — these are the commands Claude actually executed:
python3 << 'PYEOF'
import json, os
fpath = "<session-jsonl-path>"
with open(fpath) as f:
for line in f:
try:
obj = json.loads(line)
if obj.get('type') == 'assistant':
content = obj.get('message', {}).get('content', [])
if isinstance(content, list):
for block in content:
if isinstance(block, dict) and block.get('type') == 'tool_use' and block.get('name') == 'Bash':
cmd = block.get('input', {}).get('command', '')
print(f'CMD: {cmd[:400]}')
print()
except:
pass
PYEOF
Flag dangerous commands — search for:
kill, pkill, killall (process termination)tmux kill-* (tmux destruction)rm -rf, git clean, git reset --hard (destructive ops)signal, SIGKILL, SIGTERM (signal sending)Extract subagent launches — check for Agent tool calls, especially background agents:
# Same pattern but filter for block.get('name') == 'Agent'
# Check run_in_background, prompt content
Identify the LAST command before the crash — cross-reference the session's final tool call timestamp with the system log timestamps from Phase 2.
Gate: Every active session's commands are extracted. The last command before the crash is identified with timestamp.
Trace the code paths that were active at the time of the crash:
What was the last command doing? — if it was a build/test/run command, check:
exec.Command, go test, npm test)Check the crashed application's code for:
grep -rn "tmux" internal/ cmd/)Instrument production runners to trace leaked subprocess calls — when tests crash an external system (tmux, databases, etc.), add debug logging to every production runner that spawns subprocesses. Log to a persistent file so the output survives the crash:
// Add to every production runner method that calls exec.Command
func debugLogExec(caller, name string, args []string) {
f, _ := os.OpenFile("/tmp/test-exec.log", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if f != nil {
defer f.Close()
fmt.Fprintf(f, "[%s] %s %v\n", caller, name, args)
}
}
Run tests, then inspect the log to identify which packages and code paths are leaking real subprocess calls. An empty log file (or no file created) confirms zero leaks. This is especially useful when subprocess calls are indirect — e.g., an HTTP handler calls a function that calls a package function that eventually hits exec.Command through 3+ layers of indirection.
Check configuration that affects crash behavior:
exit-empty, exit-unattached, destroy-unattachedCheck recent code changes in the area:
git log --oneline --since="1 week ago" -- <relevant-paths>
Gate: Code paths from the last command to the crash point are traced. Configuration that affects cascading failures is documented.
Only now may you reason about the cause. Build the argument from evidence, not speculation.
Construct the event chain — a timestamped sequence from the trigger to the final failure. Every link must cite evidence from Phases 2-4:
HH:MM:SS.mmm — [source: unified log / session log / crash log] — event descriptionIdentify the root cause — the earliest event in the chain that, if prevented, would have avoided the failure. Distinguish:
exit-empty on)Verify the chain is complete — are there gaps? If yes, state them explicitly as unknowns.
Rule out alternatives — for each plausible alternative cause, cite the evidence that eliminates it:
| Alternative | Ruled out by |
|---|---|
| Example: manual kill-server | No kill-server in any session log |
| Example: sleep/wake SIGKILL | No powerd sleep events in window |
Present a structured report:
A table with timestamp, event, and evidence source for every link in the chain.
One paragraph explaining what happened and why, with evidence citations.
Bullet list of conditions that enabled or worsened the failure.
| Category | Result |
|---|---|
| AMFI/codesign | Found / Not found |
| Jetsam | Found / Not found |
| Sleep/wake | Found / Not found |
| Signals | Found / Not found |
| Session commands | Suspicious / Clean |
| Crash reports | Found / Not found |
Table from Phase 5.
Concrete, minimal actions to prevent recurrence. Reference specific files and line numbers.
Gate: Report delivered. Every claim cites evidence. Unknowns are stated explicitly.