npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance.
Assesses and guides implementation of system resilience patterns: circuit breakers, retries, bulkheads, graceful degradation, health checks, timeouts. Prevents cascading failures in distributed services.
Assists implementing circuit breakers, retries, bulkheads, and resilience patterns for fault-tolerant distributed systems.
Implements circuit breaker patterns for fault tolerance, failure detection, and fallbacks. Use for external API calls, microservices, databases, and preventing cascading failures. Includes TypeScript, Node.js, Python, Java guides.
Share bugs, ideas, or general feedback.
Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance.
Inventory external dependencies. Scan the codebase for outbound connections:
axios, fetch, got, HttpClient, RestTemplate, reqwestMap existing resilience patterns. For each dependency found, check for:
opossum, cockatiel, Polly, resilience4j, hystrix usageDetect anti-patterns. Flag common resilience mistakes:
Build the dependency map. Produce a structured inventory:
Classify failure modes per dependency. For each external dependency:
Assess blast radius. For each failure mode:
Evaluate current coverage. Score each dependency on resilience coverage:
Prioritize gaps by risk. Combine criticality and coverage:
Check observability. For existing patterns, verify they emit metrics:
Select patterns per dependency. Based on the failure mode analysis:
Provide concrete configurations. For each recommended pattern, specify:
Design fallback strategies. For each critical dependency:
Generate implementation templates. Produce code snippets for:
Define health check contracts. Specify how each dependency should be health-checked:
Check pattern correctness. For each implemented pattern:
Verify test coverage. Check that resilience patterns are tested:
Verify observability. Confirm that metrics are emitted:
Produce the resilience report. Output a summary:
Run integration verification. If integration tests exist:
harness skill run harness-resilience -- Primary CLI entry point. Runs all four phases.harness validate -- Run after implementing recommended patterns to verify project integrity.harness check-deps -- Verify that new resilience libraries are properly declared and within boundary rules.emit_interaction -- Used at pattern selection (checkpoint:decision) when multiple valid patterns exist and trade-offs require human judgment.Glob -- Discover HTTP clients, middleware chains, and existing resilience pattern files.Grep -- Search for timeout configurations, retry logic, circuit breaker initialization, and anti-patterns.Write -- Generate implementation templates and resilience configuration files.Edit -- Add resilience wrappers to existing service clients.Phase 1: DETECT
Dependencies found:
- Stripe API (HTTP, critical): axios client in src/payments/stripe-client.ts
Resilience: timeout=5000ms, no retry, no circuit breaker, no fallback
- PostgreSQL (database, critical): pg pool in src/db/pool.ts
Resilience: pool max=20, no query timeout, no read replica fallback
- SendGrid (HTTP, optional): @sendgrid/mail in src/notifications/email.ts
Resilience: none
Anti-patterns:
- src/payments/stripe-client.ts:45 — retry on POST /charges without idempotency key
- src/db/pool.ts — no statement_timeout configured
Phase 2: ANALYZE
Stripe failure modes:
- Timeout: Payment page hangs, user retries, duplicate charges possible
- Outage: All payments fail, revenue impact immediate
- Blast radius: checkout flow, subscription renewal, refund processing
Risk: P0 (critical + partial coverage + anti-pattern)
Phase 3: DESIGN
Stripe recommendations:
- Add opossum circuit breaker: failureThreshold=50%, resetTimeout=30s
- Add idempotency key to all Stripe charge requests
- Set timeout to 8000ms (Stripe p99 is ~3s, 2.5x headroom)
- Fallback: queue payment for async retry via Bull queue
PostgreSQL recommendations:
- Set statement_timeout=5000 in pool config
- Add pg-pool error handler with connection retry
- Configure read replica for GET endpoints via pgBouncer
Phase 4: VALIDATE
Resilience coverage: 33% -> 100% (3/3 dependencies covered)
Anti-patterns resolved: 2/2
Tests needed: circuit breaker state transitions, idempotency key generation
Phase 1: DETECT
Dependencies found:
- user-service (gRPC, critical): @grpc/grpc-js in src/clients/user.client.ts
Resilience: deadline=5s, no retry, no circuit breaker
- inventory-service (gRPC, critical): no resilience configured
- Redis (cache, degraded): ioredis in src/cache/redis.ts
Resilience: reconnectOnError, no bulkhead, no fallback
Phase 2: ANALYZE
inventory-service outage:
- Product pages return 503, search results empty
- Blast radius: catalog, search, cart validation
- Risk: P0 (critical + no coverage)
Phase 3: DESIGN
inventory-service recommendations:
- Add cockatiel circuit breaker with ConsecutiveBreaker(5)
- Add retry with exponentialBackoff(1000, 2) maxAttempts=3
- Add deadline propagation from gateway timeout
- Fallback: serve cached inventory from Redis with staleness header
Redis recommendations:
- Add bulkhead: maxPoolSize=50, separate pools for cache vs sessions
- Add fallback: in-memory LRU cache (lru-cache, max 1000 items)
- Monitor: emit redis.command.duration histogram
Phase 4: VALIDATE
Coverage: 33% -> 100%
Tests verified: gRPC circuit breaker opens after 5 failures,
Redis fallback serves from LRU when Redis is down
| Rationalization | Reality |
|---|---|
| "That third-party API has 99.99% uptime — we don't need a circuit breaker" | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. |
| "We have retry logic, so failures are handled" | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives. |
| "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production. |
| "Our database connection pool is 100 connections — that's plenty" | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required. |
| "The service is internal — it doesn't need rate limiting" | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations. |