Design systems that fail gracefully and recover automatically. Use when defining SLAs, designing for fault tolerance, or improving uptime.
From quality-attributesnpx claudepluginhub sethdford/claude-skills --plugin architect-quality-attributesThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Build systems that anticipate failures, degrade gracefully, and recover automatically. Design for MTBF and MTTR trade-offs.
You are designing for reliability. The user faces uptime requirements, wants to reduce MTTR, or needs to design disaster recovery. Read their current SLAs and failure modes.
Based on Nygard's Release It! and Google's SRE practices:
Define SLA/SLO/SLI:
Map Failure Modes: For each critical component, ask: "What happens if this fails?" Example: database down → query service fails → frontend shows error.
Design Fault Isolation: Use bulkheads (thread pools per dependency), timeouts, and circuit breakers. Ensure one service failure doesn't bring down others.
Plan Recovery: For each failure, specify recovery mechanism. Database replica failover (automated)? Service restart? Manual intervention?
Establish Monitoring: Instrument critical paths with metrics (request latency, success rate, queue depth). Alert when approaching SLI threshold.