Defines CI/CD pipeline stages for build, parallel testing, staging/production deployment with verification, monitoring, rollbacks, and zero-downtime strategies to ensure safe continuous delivery.
npx claudepluginhub tachyon-beep/skillpacks --plugin axiom-devops-engineeringThis skill uses the workspace's default tool permissions.
**Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.**
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.
Core principle: "Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers. Skipping these "for speed" causes production incidents.
Use this skill when:
Do NOT skip this for:
Every production pipeline MUST include:
1. Build → 2. Test → 3. Deploy to Staging → 4. Verify Staging → 5. Deploy to Production → 6. Verify Production → 7. Monitor
Missing any stage = production incidents waiting to happen.
Purpose: Compile, package, create artifacts
build:
- Compile code (if applicable)
- Run linters and formatters
- Build container image
- Tag with commit SHA (NOT "latest")
- Push to registry
- Create immutable artifact
Key principle: Build once, deploy everywhere. Same artifact to staging and production.
Test Pyramid in CI:
/\
/E2\ ← Few, critical paths only (5-10 tests)
/----\
/ Intg \ ← API contracts, DB integration (50-100 tests)
/--------\
/ Unit \ ← Fast, isolated, thorough (100s-1000s)
/____________\
Optimization strategies:
Anti-pattern: "Tests are slow, let's skip some" → Optimize execution, don't remove coverage
Staging MUST match production:
Deployment process:
1. Run database migrations (with rollback tested)
2. Deploy new version alongside old (blue-green)
3. Run smoke tests
4. Cutover traffic
5. Keep old version running for quick rollback
Automated verification (not manual testing):
verify_staging:
- Health check endpoint returns 200
- Critical API endpoints respond correctly
- Database migrations applied successfully
- Background jobs processing
- External integrations functional
Failure = stop pipeline, do NOT proceed to production.
Deployment Strategies (choose one):
Old (Blue) ← 100% traffic
New (Green) ← deployed, health checked, 0% traffic
→ Switch traffic to Green
→ Keep Blue running for 1 hour for rollback
→ Terminate Blue after monitoring shows Green is stable
Pros: Instant rollback, zero downtime Cons: Double infrastructure cost during deployment
Old ← 95% traffic
New ← 5% traffic (canary)
→ Monitor error rates, latency for 15 min
→ If healthy: 50% traffic
→ If healthy: 100% traffic
→ If unhealthy: immediate rollback to 100% old
Pros: Gradual risk, early warning Cons: More complex monitoring
Instances: [A, B, C, D, E]
→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially
If any fails → stop, rollback deployed instances
Pros: No extra infrastructure Cons: Mixed versions during deployment
Choose based on:
NEVER: Direct deployment with restart (causes downtime)
Automated post-deployment verification:
verify_production:
- HTTP 200 from health endpoint
- Response time < baseline + 20%
- Error rate < 1%
- Critical user flows functional (synthetic tests)
- Database connections healthy
- Cache hit rates normal
Auto-rollback triggers:
Observe for 1 hour post-deployment:
Dashboard must show:
1. Write backward-compatible migrations
- Add columns as nullable first
- Create new tables before dropping old
- Add indexes with CONCURRENTLY (Postgres)
2. Deploy application code that works with old AND new schema
3. Run migration
4. Deploy code that uses new schema exclusively
5. Clean up old schema (separate deployment)
This takes 3 deployments, not 1. That's correct.
test_migrations:
- Apply migration to test DB
- Run application tests against migrated schema
- Test rollback (down migration)
- Verify data integrity
Never skip migration rollback testing. You'll need it in production.
Anti-patterns from baseline:
❌ Hardcoded in workflow:
env:
DATABASE_URL: postgresql://user:pass@localhost/db
✅ Correct:
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
Secrets checklist:
Progression:
Developer → CI Tests → Staging → Production
Gates between environments:
Before deploying to production, verify:
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| "Just restart the service" | Causes downtime, no rollback | Use blue-green or canary deployment |
| "Tests are slow, skip some" | Removes safety net | Parallel execution, smart caching |
| "We'll add staging later" | Production becomes your staging | Create staging first, before production pipeline |
| "Migrations in deployment script" | Can't roll back safely | Backward-compatible migrations, 3-step deployment |
| "Manual verification after deploy" | Slow, error-prone, doesn't scale | Automated health checks and smoke tests |
| "Deploy on main merge" | No gate, broken main can deploy | Require staging verification first |
| Hardcoded database credentials | Security risk, can't rotate | Use secret manager |
| "Single server is fine for now" | Downtime during deployment | Use multiple instances from day one |
| Excuse | Reality |
|---|---|
| "This is just an MVP/demo" | MVP pipelines become production pipelines. Build it right once. |
| "Staging is expensive" | Production incidents are more expensive. Staging prevents them. |
| "Blue-green doubles our costs" | Downtime and incidents cost more than temporary double infrastructure. |
| "We'll add rollback later" | You need rollback when a deployment fails. Later = too late. |
| "Health checks are overkill" | Silent failures in production are worse than no deployment. |
| "Migrations always work" | They don't. Test rollbacks before you need them. |
| "Our app is too simple for this" | Deployment complexity isn't about code complexity. |
If you catch yourself thinking:
All of these mean: Your pipeline will cause production incidents.
Related skills:
"Deploy to production" is not one step. It's:
Skipping steps to "move fast" causes incidents. This IS moving fast.