Evaluates and builds production-ready systems using stability patterns like circuit breakers, bulkheads, timeouts, retries, health checks, and anti-pattern avoidance. Scores resilience 0-10.
From devopsnpx claudepluginhub wesleyegberto/software-engineering-skills --plugin devopsThis skill uses the workspace's default tool permissions.
references/anti-patterns.mdreferences/capacity-planning.mdreferences/chaos-engineering.mdreferences/deployment-strategies.mdreferences/observability.mdreferences/stability-patterns.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides implementation of event-driven hooks in Claude Code plugins using prompt-based validation and bash commands for PreToolUse, Stop, and session events.
Framework for designing, deploying, and operating production-ready software systems. Based on a fundamental truth: the software that passes QA is not the software that survives production. Production is a hostile environment -- and your system must be built to expect and handle failure at every level.
Every system will eventually be pushed beyond its design limits. The question is not whether failures will happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct -- it is resilient, observable, and designed to operate through partial failures without human intervention.
Goal: 10/10. When reviewing or creating production systems, rate them 0-10 based on adherence to the principles below. A 10/10 means full alignment with all guidelines; lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.
Six areas that determine whether software survives contact with production:
Core concept: Failures propagate through integration points, cascading across system boundaries. The most dangerous patterns are not bugs in your code -- they are emergent behaviors that arise when systems interact under stress.
Why it works: Recognizing anti-patterns lets you identify and eliminate the cracks before production traffic finds them. Every production outage traces back to one or more of these patterns. They are predictable, recurring, and preventable.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| HTTP calls | Assume every remote call can fail, hang, or return garbage | Wrap all external calls with timeout + circuit breaker |
| Database queries | Enforce result set limits on every query | Add LIMIT clause; paginate all list endpoints |
| Thread pools | Isolate pools per dependency to prevent cross-contamination | Separate thread pool for payment gateway vs. search |
| Load testing | Simulate realistic traffic including spikes and abuse patterns | Use production traffic replays, not synthetic happy-path scripts |
| Marketing events | Coordinate launches with capacity planning | Pre-scale before Black Friday; add queue for coupon redemption |
See: references/anti-patterns.md for detailed analysis of each anti-pattern with failure scenarios and detection strategies.
Core concept: Counter each anti-pattern with a stability pattern. Circuit breakers stop cascading failures. Bulkheads isolate blast radius. Timeouts reclaim stuck resources. Together they create a system that bends under load but does not break.
Why it works: These patterns work because they accept failure as inevitable and design the system's response to failure, rather than trying to prevent all failures. A circuit breaker that trips is the system working correctly -- it is protecting itself from a downstream failure.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Service calls | Circuit Breaker with threshold and recovery timeout | Open after 5 failures in 60s; half-open after 30s |
| Resource isolation | Bulkhead with dedicated pools per dependency | Separate connection pools for critical vs. non-critical services |
| Network calls | Timeout with propagation | Connect: 1s, read: 5s; propagate deadline to downstream calls |
| Retries | Exponential backoff + jitter + retry budget | Base 100ms, max 3 retries, 20% retry budget across fleet |
| Data cleanup | Steady State with automated purging | Delete sessions older than 24h; rotate logs at 500MB |
See: references/stability-patterns.md for implementation details, state machines, threshold tuning, and pattern combinations.
Core concept: Capacity is not a single number -- it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and thread counts. Capacity planning means understanding which resource becomes the bottleneck first and at what load level.
Why it works: Systems that are not capacity-tested fail in production at the worst possible moment -- during peak load. Understanding your system's actual limits (not theoretical limits) lets you set realistic SLAs and plan scaling before users hit the wall.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Load testing | Ramp to expected peak, then 2x, observe degradation curve | Gradually increase RPS until latency exceeds SLO |
| Connection pools | Size based on measured concurrency, not defaults | Measure active connections under load; set pool to P99 + 20% headroom |
| Auto-scaling | Define scaling triggers with appropriate cooldown | Scale on CPU > 70% sustained 3 min; cooldown 5 min |
| Soak testing | Run at 80% capacity for 24-72 hours | Catch memory leaks, connection leaks, file handle exhaustion |
| Capacity model | Document resource bottleneck per service | "Service X is memory-bound at 2000 RPS; needs 4GB per instance" |
See: references/capacity-planning.md for testing methodologies, resource pool management, and scalability modeling.
Core concept: Deployment (putting code on servers) and release (exposing code to users) are separate operations that should be decoupled. Separating them gives you the ability to deploy without risk and release with confidence.
Why it works: Most outages are caused by changes -- deployments, configuration updates, database migrations. Decoupling deployment from release means you can deploy code to production, verify it works, and only then route traffic to it. If something goes wrong, you roll back the release, not the deployment.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Deploys | Blue-green with health check gate | Deploy to green; run smoke tests; swap router |
| Progressive rollout | Canary with automated rollback | Route 5% traffic to canary; auto-rollback if error rate > 1% |
| Feature launch | Feature flags with emergency off switch | Ship code behind flag; enable for 10% of users; monitor; ramp |
| Schema changes | Expand-contract migration pattern | Add new column; deploy code that writes both; backfill; drop old column |
| Rollback | Instant rollback via traffic routing | Keep previous version running; rollback = switch load balancer target |
See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code practices.
Core concept: You cannot operate what you cannot observe. Observability is not an afterthought -- it is a first-class design concern. Health checks, metrics, logs, and traces are the sensory organs of your system in production.
Why it works: Production systems fail in ways that are invisible without proper instrumentation. A health check that only returns "OK" tells you nothing. Metrics without context are noise. Observability done right gives you the ability to answer questions about your system that you did not anticipate at design time.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Health endpoints | Deep health check with dependency status | /health returns status of DB, cache, queue, and disk space |
| Service metrics | RED method instrumentation | Track request rate, error rate, and p50/p95/p99 latency per endpoint |
| Resource metrics | USE method for infrastructure | Track CPU utilization, request queue depth, and error counts per host |
| Distributed tracing | Propagate trace context across service boundaries | Inject trace ID in headers; correlate logs across services |
| Alerting | Alert on SLO burn rate, not raw thresholds | "Error budget burning 10x normal rate" vs. "CPU > 80%" |
See: references/observability.md for health check design, metrics instrumentation, SLO frameworks, and alerting strategies.
Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify, not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.
Core concept: Confidence in your system's resilience comes from testing it under realistic failure conditions. Chaos engineering is the discipline of experimenting on a system in a controlled environment to build confidence in its ability to withstand turbulent conditions.
Why it works: You cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering proactively injects failures in a controlled way, turning unknown-unknowns into known-knowns before they cause real outages.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Process failure | Controlled instance termination (via chaos tooling) | Terminate one pod using Gremlin/Litmus; verify service recovers within SLO |
| Network failure | Inject latency or partition between services (via chaos tooling) | Add 500ms latency to DB calls; verify circuit breaker trips |
| Dependency failure | Simulate downstream service outage (via chaos tooling) | Return 503 from payment API; verify graceful degradation |
| Resource exhaustion | Simulate resource pressure (via chaos tooling) | Stress-test memory limits; verify process restarts cleanly |
| GameDay | Scheduled team exercise with realistic failure scenario | "Primary database goes read-only at 2pm" -- practice response |
See: references/chaos-engineering.md for experiment design, blast radius management, and building a chaos engineering practice.
| Mistake | Why It Fails | Fix |
|---|---|---|
| No timeouts on outbound calls | One slow dependency freezes the entire system | Set connect and read timeouts on every external call |
| Unbounded retries | Retry storms amplify failures instead of recovering from them | Use exponential backoff, jitter, and fleet-wide retry budgets |
| Shared thread/connection pools | One failing dependency drains resources from all features | Bulkhead: isolate pools per dependency or feature |
| Shallow health checks only | Load balancer routes traffic to instances with broken dependencies | Implement deep health checks that verify downstream connectivity |
| Testing only the happy path | System works perfectly until the first real failure | Load test, soak test, and chaos test before every major release |
| Coupling deploy and release | Every deployment is a high-risk event with all-or-nothing rollout | Use feature flags, canary releases, and blue-green deployments |
| Alerting on causes, not symptoms | High CPU alerts fire but users are fine; errors spike but no alert fires | Alert on user-facing SLIs: error rate, latency, availability |
| No capacity model | System falls over at 2x load during an event nobody planned for | Model bottleneck resources; load test to 3x expected peak |
Audit any production system:
| Question | If No | Action |
|---|---|---|
| Does every outbound call have a timeout? | Calls can hang indefinitely, blocking threads | Add connect and read timeouts to all external calls |
| Are circuit breakers in place for critical dependencies? | One dependency failure takes down the whole system | Add circuit breakers with appropriate thresholds |
| Are thread/connection pools isolated per dependency? | Shared pools allow cross-contamination of failures | Implement bulkhead pattern with dedicated pools |
| Can you deploy without downtime? | Deployments cause user-visible outages | Implement rolling, blue-green, or canary deployment |
| Do health checks verify dependency connectivity? | Dead instances receive traffic; partial failures go undetected | Add deep health checks that test DB, cache, queue |
| Are logs, metrics, and traces correlated? | Debugging requires manual log searching across services | Implement distributed tracing with correlated IDs |
| Have you load-tested beyond expected peak? | Unknown failure mode under real load | Load test to 2-3x expected peak; document breaking point |
| Do you practice failure injection? | Resilience is theoretical, not verified | Start chaos engineering with low-risk experiments |
This skill is based on Michael Nygard's essential guide to building production-ready software. For the complete methodology, war stories, and implementation details:
Michael T. Nygard is a software architect and author with over 30 years of experience building and operating large-scale production systems. He has worked across industries including finance, retail, and government, and has been responsible for systems handling millions of transactions per day. Nygard is known for bridging the gap between development and operations, advocating that architects must be responsible for the systems they design long after the code is written. The first edition of Release It! (2007) became a foundational text in the DevOps and site reliability engineering movements. The second edition (2018) expands coverage to cloud-native architectures, containerization, and modern deployment practices. Nygard is a frequent conference speaker and has contributed to the broader conversation about resilience engineering, sociotechnical systems, and the human factors that influence production stability.