Help us improve
Share bugs, ideas, or general feedback.
From quality-attributes
Design systems that fail gracefully and recover automatically. Use when defining SLAs, designing for fault tolerance, or improving uptime.
npx claudepluginhub sethdford/claude-skills --plugin architect-quality-attributesHow this skill is triggered — by the user, by Claude, or both
Slash command
/quality-attributes:reliability-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build systems that anticipate failures, degrade gracefully, and recover automatically. Design for MTBF and MTTR trade-offs.
Designs Service Level Objectives (SLOs) with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets, calculating error budgets, or establishing service indicators.
Designs SLOs with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets or service indicators.
Provides SRE templates for SLOs, error budgets with Prometheus, and JavaScript patterns like circuit breakers and exponential backoff retries for reliable distributed systems.
Share bugs, ideas, or general feedback.
Build systems that anticipate failures, degrade gracefully, and recover automatically. Design for MTBF and MTTR trade-offs.
You are designing for reliability. The user faces uptime requirements, wants to reduce MTTR, or needs to design disaster recovery. Read their current SLAs and failure modes.
Based on Nygard's Release It! and Google's SRE practices:
Define SLA/SLO/SLI:
Map Failure Modes: For each critical component, ask: "What happens if this fails?" Example: database down → query service fails → frontend shows error.
Design Fault Isolation: Use bulkheads (thread pools per dependency), timeouts, and circuit breakers. Ensure one service failure doesn't bring down others.
Plan Recovery: For each failure, specify recovery mechanism. Database replica failover (automated)? Service restart? Manual intervention?
Establish Monitoring: Instrument critical paths with metrics (request latency, success rate, queue depth). Alert when approaching SLI threshold.