System resilience -- circuit breakers, retries, bulkheads, graceful degradation, health checks.
From godmodenpx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
references/resilience-patterns.mdDesigns and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:resilience, "circuit breaker", "retry"grep -r "circuitBreaker\|CircuitBreaker\|opossum\|gobreaker" \
--include="*.ts" --include="*.js" --include="*.go" \
-l 2>/dev/null
grep -r "retry\|backoff\|Retry" \
--include="*.ts" --include="*.go" -l 2>/dev/null
Evaluate: circuit breakers, retries, timeouts, bulkheads, rate limits, fallbacks, health checks.
States: CLOSED -> OPEN (on threshold) -> HALF-OPEN (after cooldown) -> test one request -> CLOSED/OPEN.
Config per dependency:
Failure threshold: 5 failures in 30s
Half-open timeout: 30s
Success threshold: 3 (to close from half-open)
Monitored exceptions: 5xx, timeout, connection error
Fallback: cached data | default | error
IF no circuit breaker on external call: add one. IF circuit opens too often: tune threshold up.
Formula: min(base_ms * 2^attempt + jitter, max_ms)
Retryable: network timeout, 5xx, DB connection, 429
Non-retryable: 4xx validation, auth, deserialization
Max attempts: 3 (default), 5 (critical operations)
NEVER retry non-idempotent operations without idempotency key. ALWAYS add jitter to prevent thundering herd.
Isolate failure domains. Separate connection pools or semaphores per dependency. One slow dependency must not exhaust all resources.
Config: max concurrent per dependency
Payment API: 20 concurrent, queue 10
Email service: 10 concurrent, queue 5
Search: 30 concurrent, queue 20
Token bucket for burst tolerance. Sliding window
counter for steady limits. See /godmode:ratelimit
for full implementation.
Define fallback per dependency:
Recommendation engine down -> show popular items
Search down -> show cached results + apology
Payment down -> queue order, process later
Analytics down -> skip tracking silently
Levels: NORMAL -> DEGRADED -> MINIMAL -> MAINTENANCE.
livenessProbe:
httpGet: { path: /health/live, port: 8080 }
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet: { path: /health/ready, port: 8080 }
periodSeconds: 5
failureThreshold: 2
Client request: 30s total
-> API gateway: 25s
-> Service call: 5s (connect 1s + read 4s)
-> DB query: 3s
-> External API: 10s
Each layer timeout < parent timeout. Use timeout budget: pass remaining time downstream.
[ ] Circuit breaker on all external deps
[ ] Retry with exponential backoff + jitter
[ ] Bulkhead isolation between deps
[ ] Fallback for every circuit breaker
[ ] Health checks (liveness + readiness)
[ ] Timeouts on all calls (connect + read)
[ ] Rate limiting at API boundary
Append .godmode/resilience-results.tsv:
timestamp pattern target dependency config test_result status
KEEP if: dependency failure triggers graceful
degradation AND circuit opens AND fallback works.
DISCARD if: failure cascades OR no fallback
OR timeout not configured.
STOP when ALL of:
- All critical deps have circuit breakers
- All retries use backoff + jitter
- All calls have timeouts
- All circuit breakers have fallbacks
- Verification checklist passes
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Circuit never opens | Check threshold, include timeouts |
| Retry storm | Add backoff+jitter, circuit breaker first |
| Stale fallback data | Set max staleness, alert on primary |
| Bulkhead rejects too many | Tune pool size, check downstream |