Design deployment strategy with zero-downtime, rollback capability, and verification gates
Design a production deployment strategy with zero-downtime, automatic rollback, and verification gates. Use this when planning releases for critical systems that need instant rollback capability and gradual traffic rollout.
/plugin marketplace add tachyon-beep/skillpacks/plugin install axiom-devops-engineering@foundryside-marketplace[application_or_service_name]You are designing a deployment strategy that ensures zero-downtime, automatic rollback, and production safety.
"Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers.
Before designing, determine:
| Factor | Blue-Green | Canary | Rolling |
|---|---|---|---|
| Best for | Critical systems | High-traffic | Cost-sensitive |
| Infrastructure cost | 2x during deploy | 1.1x during deploy | 1x |
| Rollback speed | Instant | Fast | Gradual |
| Complexity | Medium | High | Low |
| Traffic control | All-or-nothing | Percentage-based | Instance-based |
| Risk exposure | Low | Very low | Medium |
How it works:
Old (Blue) ← 100% traffic
New (Green) ← deployed, verified, 0% traffic
→ Health check Green
→ Switch traffic to Green (instant)
→ Keep Blue running 1 hour
→ If issues: switch back to Blue (instant rollback)
→ Terminate Blue when stable
Use when:
Implementation:
# Kubernetes example
deploy_green:
- Deploy new pods with label: version=green
- Wait for health checks to pass
- Run smoke tests against green (internal)
cutover:
- Update service selector to version=green
- Monitor for 15 minutes
rollback:
- Update service selector to version=blue
- Investigate issues
cleanup:
- After 1 hour stable, delete blue pods
How it works:
Old ← 95% traffic
New (Canary) ← 5% traffic
→ Monitor error rates, latency for 15 min
→ If healthy: shift to 25%
→ If healthy: shift to 50%
→ If healthy: shift to 100%
→ If unhealthy at any stage: immediate 100% to old
Use when:
Implementation:
canary_stages:
- percentage: 5
duration: 15m
success_criteria:
error_rate: < 1%
p99_latency: < 500ms
- percentage: 25
duration: 15m
success_criteria:
error_rate: < 1%
p99_latency: < 500ms
- percentage: 50
duration: 15m
- percentage: 100
rollback_triggers:
- error_rate: > 5%
- p99_latency: > 1000ms
- health_check_failures: > 2
How it works:
Instances: [A, B, C, D, E]
→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially
→ If any fails: stop, rollback deployed instances
Use when:
Implementation:
rolling:
max_unavailable: 1 # One instance at a time
max_surge: 0 # No extra instances
per_instance:
- Drain connections (30s)
- Deploy new version
- Health check (retry 3x)
- Resume traffic
on_failure:
- Stop rollout
- Rollback completed instances
stages:
- build
- test
- deploy_staging
- verify_staging
- deploy_production
- verify_production
- monitor
build:
steps:
- Compile/build
- Create container image
- Tag with $COMMIT_SHA
- Push to registry
test:
parallel:
- unit_tests
- integration_tests
- security_scan
deploy_staging:
environment: staging
strategy: [chosen_strategy]
verify_staging:
automated:
- health_check
- smoke_tests
- integration_tests
gate: block_on_failure
deploy_production:
environment: production
strategy: [chosen_strategy]
requires:
- verify_staging: passed
- manual_approval: optional
verify_production:
automated:
- health_check
- smoke_tests
- error_rate_check
- latency_check
auto_rollback:
- on: health_check_failure
- on: error_rate > 5%
- on: latency > 2x_baseline
monitor:
duration: 1_hour
dashboard: deployment_metrics
alerts: on_anomaly
health_endpoint: /health
checks:
- name: application_ready
path: /health/ready
success: status == 200
- name: database_connected
path: /health/db
success: status == 200
- name: dependencies_healthy
path: /health/dependencies
success: status == 200
probe_config:
initial_delay: 10s
interval: 5s
timeout: 3s
failure_threshold: 3
success_threshold: 1
rollback_triggers:
immediate:
- health_check_failures >= 3
- error_rate > 10%
gradual:
- error_rate > 5% for 3 minutes
- p99_latency > 2x baseline for 5 minutes
- pod_restarts > 3 in 5 minutes
rollback_action:
blue_green: switch_to_previous_version
canary: route_100%_to_stable
rolling: redeploy_previous_version
migration_approach: three_phase
phase_1:
description: Add new schema (backward compatible)
changes:
- Add nullable columns
- Create new tables
- Add indexes with CONCURRENTLY
rollback: drop_new_elements
phase_2:
description: Deploy code using both schemas
code_changes:
- Write to both old and new
- Read from new with fallback
rollback: revert_code
phase_3:
description: Remove old schema
changes:
- Drop old columns
- Remove old tables
timing: after_phase_2_stable_1_week
# Deployment Design: [Application Name]
## Context
**Infrastructure**: [K8s/Containers/VMs/Serverless]
**Traffic**: [Requests/second]
**Risk Tolerance**: [High/Medium/Low]
**Current State**: [Existing process]
## Chosen Strategy
**Strategy**: [Blue-Green/Canary/Rolling]
**Rationale**: [Why this strategy fits]
## Pipeline Design
### Stages
```yaml
[Complete pipeline YAML]
[Strategy-specific configuration]
[Health check configuration]
[Rollback trigger configuration]
Approach: [Three-phase/Zero-downtime/N/A]
[Migration steps if applicable]
Trigger: [Conditions] Action: [Steps] Verification: [How to confirm rollback worked]
## Cross-Pack Discovery
```python
import glob
# For monitoring setup
quality_pack = glob.glob("plugins/ordis-quality-engineering/plugin.json")
if quality_pack:
print("Available: ordis-quality-engineering for observability patterns")
# For API deployment
web_pack = glob.glob("plugins/axiom-web-backend/plugin.json")
if web_pack:
print("Available: axiom-web-backend for API deployment patterns")
This command covers:
Not covered: