Use when setting up CI/CD pipelines, experiencing deployment failures, slow feedback loops, or production incidents after deployment - provides deployment strategies, test gates, rollback mechanisms, and environment promotion patterns to prevent downtime and enable safe continuous delivery
Provides CI/CD pipeline architecture with deployment strategies, test gates, and rollback mechanisms to prevent downtime. Use when setting up pipelines, experiencing deployment failures, or slow feedback loops.
/plugin marketplace add tachyon-beep/skillpacks/plugin install axiom-devops-engineering@foundryside-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.
Core principle: "Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers. Skipping these "for speed" causes production incidents.
Use this skill when:
Do NOT skip this for:
Every production pipeline MUST include:
1. Build → 2. Test → 3. Deploy to Staging → 4. Verify Staging → 5. Deploy to Production → 6. Verify Production → 7. Monitor
Missing any stage = production incidents waiting to happen.
Purpose: Compile, package, create artifacts
build:
- Compile code (if applicable)
- Run linters and formatters
- Build container image
- Tag with commit SHA (NOT "latest")
- Push to registry
- Create immutable artifact
Key principle: Build once, deploy everywhere. Same artifact to staging and production.
Test Pyramid in CI:
/\
/E2\ ← Few, critical paths only (5-10 tests)
/----\
/ Intg \ ← API contracts, DB integration (50-100 tests)
/--------\
/ Unit \ ← Fast, isolated, thorough (100s-1000s)
/____________\
Optimization strategies:
Anti-pattern: "Tests are slow, let's skip some" → Optimize execution, don't remove coverage
Staging MUST match production:
Deployment process:
1. Run database migrations (with rollback tested)
2. Deploy new version alongside old (blue-green)
3. Run smoke tests
4. Cutover traffic
5. Keep old version running for quick rollback
Automated verification (not manual testing):
verify_staging:
- Health check endpoint returns 200
- Critical API endpoints respond correctly
- Database migrations applied successfully
- Background jobs processing
- External integrations functional
Failure = stop pipeline, do NOT proceed to production.
Deployment Strategies (choose one):
Old (Blue) ← 100% traffic
New (Green) ← deployed, health checked, 0% traffic
→ Switch traffic to Green
→ Keep Blue running for 1 hour for rollback
→ Terminate Blue after monitoring shows Green is stable
Pros: Instant rollback, zero downtime Cons: Double infrastructure cost during deployment
Old ← 95% traffic
New ← 5% traffic (canary)
→ Monitor error rates, latency for 15 min
→ If healthy: 50% traffic
→ If healthy: 100% traffic
→ If unhealthy: immediate rollback to 100% old
Pros: Gradual risk, early warning Cons: More complex monitoring
Instances: [A, B, C, D, E]
→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially
If any fails → stop, rollback deployed instances
Pros: No extra infrastructure Cons: Mixed versions during deployment
Choose based on:
NEVER: Direct deployment with restart (causes downtime)
Automated post-deployment verification:
verify_production:
- HTTP 200 from health endpoint
- Response time < baseline + 20%
- Error rate < 1%
- Critical user flows functional (synthetic tests)
- Database connections healthy
- Cache hit rates normal
Auto-rollback triggers:
Observe for 1 hour post-deployment:
Dashboard must show:
1. Write backward-compatible migrations
- Add columns as nullable first
- Create new tables before dropping old
- Add indexes with CONCURRENTLY (Postgres)
2. Deploy application code that works with old AND new schema
3. Run migration
4. Deploy code that uses new schema exclusively
5. Clean up old schema (separate deployment)
This takes 3 deployments, not 1. That's correct.
test_migrations:
- Apply migration to test DB
- Run application tests against migrated schema
- Test rollback (down migration)
- Verify data integrity
Never skip migration rollback testing. You'll need it in production.
Anti-patterns from baseline:
❌ Hardcoded in workflow:
env:
DATABASE_URL: postgresql://user:pass@localhost/db
✅ Correct:
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
Secrets checklist:
Progression:
Developer → CI Tests → Staging → Production
Gates between environments:
Before deploying to production, verify:
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| "Just restart the service" | Causes downtime, no rollback | Use blue-green or canary deployment |
| "Tests are slow, skip some" | Removes safety net | Parallel execution, smart caching |
| "We'll add staging later" | Production becomes your staging | Create staging first, before production pipeline |
| "Migrations in deployment script" | Can't roll back safely | Backward-compatible migrations, 3-step deployment |
| "Manual verification after deploy" | Slow, error-prone, doesn't scale | Automated health checks and smoke tests |
| "Deploy on main merge" | No gate, broken main can deploy | Require staging verification first |
| Hardcoded database credentials | Security risk, can't rotate | Use secret manager |
| "Single server is fine for now" | Downtime during deployment | Use multiple instances from day one |
| Excuse | Reality |
|---|---|
| "This is just an MVP/demo" | MVP pipelines become production pipelines. Build it right once. |
| "Staging is expensive" | Production incidents are more expensive. Staging prevents them. |
| "Blue-green doubles our costs" | Downtime and incidents cost more than temporary double infrastructure. |
| "We'll add rollback later" | You need rollback when a deployment fails. Later = too late. |
| "Health checks are overkill" | Silent failures in production are worse than no deployment. |
| "Migrations always work" | They don't. Test rollbacks before you need them. |
| "Our app is too simple for this" | Deployment complexity isn't about code complexity. |
If you catch yourself thinking:
All of these mean: Your pipeline will cause production incidents.
Related skills:
"Deploy to production" is not one step. It's:
Skipping steps to "move fast" causes incidents. This IS moving fast.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.