From agent-atelier
Health check and mechanical recovery — detect stale leases, expired candidates, stuck reviews, and blocked work items, then take safe recovery actions. Also enforces operating budgets and replays interrupted transactions. Use for routine health checks, when something seems stuck, or when the user says 'watchdog', 'health check', 'check for stale items', 'anything stuck?', 'run maintenance', 'check budgets', 'recovery sweep', or 'clean up orphaned leases'. Also appropriate after a session crash or long idle period.
npx claudepluginhub ether-moon/agent-atelier --plugin agent-atelierThis skill uses the workspace's default tool permissions.
Agent sessions can crash, time out, or disappear. Without cleanup, work items stay in `implementing` forever with expired leases, blocking progress. The watchdog detects these situations and takes safe, mechanical recovery actions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Guides code writing, review, and refactoring with Karpathy-inspired rules to avoid overcomplication, ensure simplicity, surgical changes, and verifiable success criteria.
Share bugs, ideas, or general feedback.
Agent sessions can crash, time out, or disappear. Without cleanup, work items stay in implementing forever with expired leases, blocking progress. The watchdog detects these situations and takes safe, mechanical recovery actions.
In the long-running loop, watchdog tick is the mechanical half of a 15-minute recovery pulse. Teammate respawn, owner reachability checks, and work re-dispatch are Orchestrator responsibilities after the tick completes.
/agent-atelier:run.agent-atelier/ state files exist)The watchdog performs ONLY mechanical, reversible recovery. It never edits product code, merges branches, resolves human gates, invents validation results, or makes product decisions. Promoting the next candidate from candidate_queue is mechanical (FIFO order was already decided). If something requires judgment, the watchdog escalates to the orchestrator.
The watchdog does not assess whether a lease holder is still reachable. If a WI remains implementing with an unexpired lease, the watchdog leaves it alone; the Orchestrator's recovery pulse handles reachability.
All state mutations go through state-commit. The watchdog reads all three state files, computes recovery actions, and commits them in a single transaction with expected_revision per file. On stale_revision, re-read state and retry.
tickThe only subcommand. Runs the full health check and recovery sweep described in Execution Steps below.
Each step is summarized below. See reference/execution-details.md for field-level specifics.
.agent-atelier/.pending-tx.json exists, replay the interrupted transaction with state-commit --replay before any other checks.loop-state.json, work-items.json, watchdog-jobs.json. Note timeout thresholds: implementing (90 min), candidate (30 min), review (30 min), gate warning (24 hr).implementing WI with an expired lease: requeue to ready, clear lease fields, increment stale_requeue_count.reviewing WI past review_timeout_minutes: requeue to ready.active_candidate_set has timed out: requeue all WIs, clear promotion metadata, advance queue (FIFO) if non-empty.human-gates/open/ for HDR files older than gate_warn_after_hours. Create warning alerts.budget_exceeded alerts (no auto-cancel).state-commit transaction with expected_revision per file. On stale revision, re-read and retry.After the report, the Orchestrator may respawn teammates, re-message owners, requeue WIs with unreachable owners, and re-dispatch recovered work.
Invocation (typical):
/agent-atelier:watchdog tick
Clean tick (no issues):
{"request_id": "REQ-WD-001", "accepted": true, "changed": false, "recovered": [], "alerts_created": [], "escalations": []}
Tick with recovery:
{"request_id": "REQ-WD-002", "accepted": true, "changed": true, "tick_at": "2026-04-08T12:00:00Z",
"recovered": [{"work_item_id": "WI-014", "action": "requeued", "reason": "lease expired", "stale_requeue_count": 2}],
"alerts_created": [{"id": "WDA-003", "type": "long_open_gate", "work_item_id": "WI-012"}],
"escalations": []}
| Code | Meaning |
|---|---|
0 | Tick completed (with or without recovery actions) |
1 | Usage error (invalid arguments) |
2 | Stale revision (another writer changed state between read and commit) |
3 | State files not found (not initialized) |
4 | Runtime or environment failure |
The tick subcommand takes no payload. Optional flag:
--request-id <id> — audit trail identifier (optional since tick is inherently idempotent)Returns JSON to stdout. See reference/execution-details.md for the full schema. When presenting to a human, also render a readable summary showing recovered items, alerts, escalations, and healthy items.
Watchdog tick is inherently idempotent. Running it multiple times with the same state produces the same recovery actions. Already-recovered items will not be in the triggering state on re-run.
| Condition | Exit Code | Action |
|---|---|---|
| State files missing | 3 | Suggest running /agent-atelier:init |
| WI references non-existent gate | 0 | Log inconsistency, suggest manual cleanup |
| Runtime error during tick | 4 | Report and stop — never make partial writes |
| Stale revision on commit | 2 | Re-read all state files and retry the entire tick |