Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

Error Recovery Protocol | agent-guardrails | ClaudePluginHub

Skill

Error Recovery Protocol

From agent-guardrails

Guides structured recovery from errors and failures by assessing, logging root cause, and applying safe fixes with rollback plans. Prevents cascading mistakes via forbidden patterns and escalation rules.

developer-tools

$

npx claudepluginhub thearchitectit/agent-guardrails-template

Popularity

Stars

38

Forks

2

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agent-guardrails:error-recovery

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

How to recover from failures, errors, and unexpected states without making things worse.

SKILL.md

111 lines · ~978 tokens

Similar Skills

error-recovery

32

Use when repeated fix attempts fail, the agent appears stuck in a loop, or complexity is increasing without progress

debugging-and-error-recovery

43.1k

Guides systematic root-cause debugging via triage checklist for test failures, build breaks, unexpected behavior, logs, and errors.

stepback

87

Forces architectural reassessment after 2+ failed fixes, whack-a-mole symptoms, or cascading errors. Inventories attempts, finds shared systems, researches via Perplexity. Proactive for infra config.

atlas-session-lifecycle

Stats

LanguageGo

Stars38

Forks2

MaintenanceExcellent

Last CommitMay 20, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

failure-handling

incident-response

Help us improve

Share bugs, ideas, or general feedback.

Error Recovery Protocol

How to recover from failures, errors, and unexpected states without making things worse.

Recovery Principles

Stop the bleeding first
Understand before fixing
Never make two changes at once
Have a rollback plan

Recovery Steps

Step 1: Stop and Assess

Do NOT immediately try another fix
Note the exact error message and context
Determine if the failure is isolated or cascading
Check if any data was corrupted or lost

Step 2: Log the Failure

Document what was attempted
Document the exact error
Document the state of the system before and after
This feeds into the failure registry for regression prevention

Step 3: Identify Root Cause

Read the error carefully (stack traces, logs, output)
Look for the FIRST error, not the last (cascading errors hide the root)
Check for recent changes that may have triggered it
Verify environment state (files, configs, dependencies)

Step 4: Choose Recovery Path

Situation	Action
Clear cause, safe fix	Apply targeted fix
Unclear cause	HALT and ask user for guidance
Data corruption	Restore from backup or rollback
Environment issue	Rebuild/reset environment
Dependency issue	Update, downgrade, or pin dependency

Step 5: Verify Recovery

Reproduce the original scenario
Confirm the fix resolves the issue
Check no new issues were introduced
Run relevant tests

Forbidden Recovery Patterns

NEVER do these when recovering from errors:

Blind retry — Trying the same thing again without understanding why it failed
Shotgun debugging — Changing multiple things at once hoping one works
Comment-and-forget — Commenting out broken code instead of fixing it
Production hotfix without testing — Pushing fixes directly to production
Ignoring test failures — "It's just a flaky test" without investigation

Rollback Checklist

Before any significant change, know your rollback plan:

Can you revert the change with a single git command?
Is the previous version known to work?
Can you restore data if needed?
Is there a database migration to undo?
Will rollback affect other users/systems?

Escalation Criteria

Escalate to user when:

Root cause cannot be determined in 2 attempts
Recovery requires destructive operations (delete, reset, restore from backup)
Error affects production data
You're on your 3rd recovery attempt (Three Strikes Rule)

Pi Enforcement

When running in pi, error recovery is supported by the @architectit/pi-guardrails extension:

Three Strikes: guardrail_record_attempt tracks failures automatically — at 3 strikes, halting is recommended
Violation logging: guardrail_log_violation records failure context for post-mortem analysis
Sandbox: guardrail_mcp with action sandbox_run provides isolated execution for testing recovery steps
Bash safety: Prevents destructive commands during recovery (no rm -rf, sudo, etc.)

See [[sandbox-isolation]] and [[guardrails-core]] for details.

Task

Apply the Error Recovery Protocol to the current failure or error situation. Guide the user through stopping, assessing, understanding, and fixing the problem without making it worse. Follow the recovery steps above and escalate when the escalation criteria are met.

References

docs/workflows/ROLLBACK_PROCEDURES.md — Detailed rollback procedures
docs/workflows/AGENT_ESCALATION.md — When and how to escalate
docs/standards/TEST_PRODUCTION_SEPARATION.md — Environment isolation