Help us improve
Share bugs, ideas, or general feedback.
From agent-squad-hr
Recover gracefully when an Anthropic credit/billing failure stalls multiple in-flight parallel subagents mid-orchestration, then later resolves. Use when: (1) you've dispatched 2+ parallel subagents (Task tool with `run_in_background: true`) and a credit/billing issue, MCP outage, or other transient harness failure has frozen them, (2) the user reports the issue is now resolved and asks you to continue, (3) you're about to relaunch "resume" agents to pick up where stalled originals left off. Defends against the silent-revive trap: when API access is restored, ORIGINAL stalled subagents can auto-resume from where they froze and continue working IN PARALLEL with any "resume" agents you dispatch — leading to two agents racing on the same branch / PR / files, duplicate review comments, or one agent overwriting the other's work. Captures the diagnostic recipe (JSONL mtime + worktree-state inspection, NOT just agent status) and the safe resume pattern (state-aware briefs that detect current state and pivot to value-add if originals already recovered). Sister skill to `subagent-reports-complete-but-pr-unmerged` (different gap: completed status vs unmerged PR) and `anthropic-credit-balance-error-vs-app-bug` (different layer: diagnosing fake vs real credit errors, not recovering from real ones).
npx claudepluginhub wan-huiyan/agent-squad-hrHow this skill is triggered — by the user, by Claude, or both
Slash command
/agent-squad-hr:credit-stall-mid-orchestration-revive-collisionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You're orchestrating N parallel subagents (Task tool with `run_in_background: true`),
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
You're orchestrating N parallel subagents (Task tool with run_in_background: true),
each working on independent tracks (e.g., one PR per track). Mid-run, an Anthropic
billing/credit failure or other transient harness outage hits — all in-flight subagents
freeze simultaneously. The user fixes billing, reports back, and asks you to continue.
The trap: stalled subagents are NOT dead — they're suspended at the harness layer. When API access is restored, the harness can auto-resume them from the exact tool-call boundary they froze on. If you naively relaunch "resume" agents to pick up where you think the originals stopped, you end up with TWO agents racing on:
Symptoms when this happens: PR has two "no issues found" review comments, force-push storms as both agents rebase, agent A reports "PR already merged" while agent B is still polling CI for that same PR.
Severity ladder (observed in an earlier session, 2026-05-08, all three originals revived ~2hr post-stall):
git diff origin/main HEAD -- <files>, finds zero diff on feature files, recommends "do not file PR / delete
branch", terminates. No PR debris, just a status-report task-notification.The original agent's behavior on revive is non-uniform because it depends on which tool-call boundary the harness froze it at. If frozen pre-implementation, it does the safe thing. If frozen post-push-pre-PR, it files a duplicate PR before any sanity check fires.
ALL of the following:
<task-notification> storm of completed events stopped or stayed silent for
significantly longer than each agent's typical heartbeat — OR the user reports
billing / credit / connection errors mid-sessionThe probability of revive-collision rises with the number of parallel agents and the duration of the stall. Single-agent sessions usually don't hit this — the harness either resumes the one agent or doesn't.
Don't trust "no completion notification" alone. Check three signals per stalled agent:
A. JSONL transcript mtime — proves whether the agent is making any moves at all:
# The output_file in the task-notification is a symlink. Use -L to follow.
stat -f "%Sm %z bytes %N" -L /private/tmp/claude-PID/...task-id.output
If mtime hasn't advanced in >5 min AND the file size hasn't grown, the agent is stalled at the harness layer (no API responses landing). If mtime is fresh (<2 min ago), the agent is alive and you should NOT launch a resume — wait for it.
B. Worktree state — proves what the agent had finished before stalling:
cd <worktree-path>
git log --oneline origin/main..HEAD # any commits made?
git status -sb # any uncommitted modifications?
git ls-remote origin <branch-name> # was the branch pushed?
This tells you the last visible side effect the agent achieved. Possibilities:
| Worktree state | Agent had reached | Resume strategy |
|---|---|---|
| Clean, no commits, no remote branch | Just started reading | Restart from scratch |
| Modified files, no commits | Mid-implementation | Resume with WIP-state brief |
| Committed locally, not pushed | Code done, no PR yet | Resume with "push + open PR" |
| Committed + pushed, no PR | Pushed but no PR | Resume with "open PR + merge" |
| PR opened, OPEN state | PR opened, in CI/review | Don't relaunch — verify yourself |
| PR MERGED | All done | Don't relaunch — declare victory |
C. PR state via gh CLI — proves the externally-visible end state:
gh pr list --repo <repo> --state all --search "head:<branch>" \
--json number,state,mergedAt,mergeCommit
A MERGED state trumps everything else: the original agent finished. Never launch a resume on a MERGED branch — at best you'll do redundant work; at worst you'll add a fixup commit to a deleted branch.
If you decide to relaunch, the resume brief MUST:
When you launch a resume agent, ALSO assume the original will revive in parallel. Concretely:
resume/<original-branch> if the work needs to be done
out-of-band.gh pr view or git log origin/main) and gracefully
pivot — that's why Step 2's pivot authorization matters.git push, files a fresh duplicate PR, THEN runs its diff
check and self-reports "close as superseded". Orchestrator must:
gh pr close <duplicate#> --comment "Closing as superseded — PR #<canonical> shipped this. <link to credit-stall skill>."
gh api -X DELETE repos/<owner>/<repo>/git/refs/heads/<branch-name>
If only 1-2 tracks remain and the work is simple (e.g., merge an already-green PR,
verify a deploy), do it in the main orchestrator thread. Don't burn a fresh subagent
budget waiting on gh pr checks --watch polling.
After dispatching resume briefs:
# Confirm only one agent is making progress on each branch
for branch in feat/track-A feat/track-B feat/track-C; do
echo "=== $branch ==="
gh pr list --repo <repo> --state all --search "head:$branch" \
--json number,state,headRefName,updatedAt --jq '.[] | {number, state, updatedAt}'
done
# Check for force-push storms (revive collision symptom)
gh pr view <PR#> --repo <repo> --json commits --jq '.commits | length'
# >5 force-pushes in a few minutes = collision smell
After all PRs merge, check for the dual-review-comment fingerprint:
gh pr view <PR#> --repo <repo> --comments | grep -c "No issues found"
# 2 = revive collision happened on review pass
Also check for late-filed duplicate PRs ~1-3hr after the canonical PR merged:
gh pr list --repo <repo> --state open \
--json number,title,headRefName,createdAt --jq '.[] | select(.title | contains("<keyword from canonical PR>"))'
# Any open PR with the same title pattern as a recently-merged canonical = duplicate
an earlier session 4-track orchestration on the project repo (2026-05-08):
stat -f "%Sm" -L). User flagged the issue.git log origin/main showed merge), pivoted to value-add
work: re-enabled test_monitor_template.py (closes #473), filed tracker
finalization PR #571, manually closed #473.<task-notification> system can deliver a flood of completed events from a
zombie agent doing zero-tool-use polling cycles after its main work is done. Don't
interpret these as forward progress — check JSONL mtime + git state.UU (both-modified, unmerged) markers is a normal step
in barryu-pr-conflict-site-regen flow — don't mistake it for a stall, the agent
is mid-rebase.subagent-reports-complete-but-pr-unmerged (parent receives
status: completed but PR isn't merged — different gap, complementary recovery)anthropic-credit-balance-error-vs-app-bug (diagnose fake-vs-real
credit errors at write-time; this skill picks up after the credit issue is real
and resolved)barryu-pr-conflict-site-regen (rebase conflict resolution after
parallel-track collisions on docs/site/ + generate_tracker.py)using-git-worktrees (the substrate that makes parallel tracks
isolatable in the first place)dispatching-parallel-agents (when to choose parallel dispatch in
the first place — informs how many tracks to risk a stall on)