From vfm-agent-company
DevOps and Release Engineering from Netflix (8000+ deployments/day, 260M+ subscribers). Use when setting up CI/CD pipelines (GitHub Actions), containerizing with Docker, deploying to Kubernetes, configuring monitoring/alerting (Datadog, Prometheus), implementing canary/blue-green deployments, writing operations runbooks, or managing production releases. Triggers on CI/CD, deployment, pipeline, Docker, release, monitoring, alerting, rollback, or infrastructure automation.
npx claudepluginhub duylinhdang1998/claude-template-agent --plugin vfm-agent-companyThis skill uses the workspace's default tool permissions.
**Purpose**: Complete DevOps and release management procedures from Netflix
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Purpose: Complete DevOps and release management procedures from Netflix
Agent: Netflix DevOps Engineer Use When: Phase 5 (Packaging) & Phase 6 (Deployment) - need CI/CD, infrastructure, and deployment automation
This skill module provides comprehensive DevOps procedures used at Netflix to handle 8,000+ deployments/day serving 260M+ subscribers with 99.999% uptime.
Core Philosophy:
File: references/ci-cd-pipelines.md
Covers:
When to Use: Phase 3 (Development) - setup early
Example Pipeline:
Jobs:
1. Quality Gates (lint, type-check, security scan)
2. Test (unit, integration, E2E)
3. Build (Docker image)
4. Deploy Staging (auto)
5. Performance Test (auto)
6. Deploy Production (manual approval + canary)
File: references/infrastructure-as-code.md
Covers:
When to Use: Phase 2 (Design) - infrastructure planning Phase 5 (Packaging) - provision infrastructure
Example Terraform Structure:
infrastructure/
├── modules/
│ ├── vpc/
│ ├── eks/
│ ├── rds/
│ └── redis/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── production/
└── main.tf
File: references/kubernetes-deployment.md
Covers:
When to Use: Phase 6 (Deployment)
Example Manifests:
File: references/monitoring-alerting.md
Covers:
When to Use: Phase 6 (Deployment) - before production
Key Alerts:
File: references/release-management.md
Covers:
When to Use: Phase 7 (Release & Distribution)
Netflix Canary Process:
1. Deploy canary (10% traffic)
2. Monitor 5-10 min (error rate, latency)
3. If healthy: Promote to 50%
4. Monitor 5-10 min
5. If healthy: Promote to 100%
6. If issues: Instant rollback
File: references/operations-runbooks.md
Covers:
When to Use: Phase 6 (Handover) - training ops team
Runbook Sections:
# Phase 3: CI/CD Setup
Read: references/ci-cd-pipelines.md
Create: .github/workflows/ci-cd.yml
Test: Run pipeline on test branch
# Phase 2-5: Infrastructure
Read: references/infrastructure-as-code.md
Write: Terraform for AWS/GCP/Azure
Provision: dev, staging, prod environments
# Phase 5: Kubernetes Config
Read: references/kubernetes-deployment.md
Write: k8s manifests (deployment, service, HPA, ingress)
Test: Deploy to staging
# Phase 6: Monitoring
Read: references/monitoring-alerting.md
Setup: Datadog/New Relic
Create: Dashboards and alerts
# Phase 6: Deployment
Read: references/release-management.md
Deploy: Canary to production
Monitor: Post-deployment metrics
# Phase 6: Handover
Read: references/operations-runbooks.md
Create: Operations runbook
Train: Operations team
PM: Review deployment timeline, coordinate releases QA: Use staging environment for testing Developers: Understand CI/CD pipeline, fix build failures BA: Use staging for UAT
Advantages:
✅ Gradual rollout (10% → 50% → 100%)
✅ Early issue detection (affects only 10% initially)
✅ Easy rollback (revert traffic routing)
Process:
1. Deploy new version alongside old
2. Route 10% traffic to new version
3. Monitor metrics (5-10 min)
4. If healthy: Increase to 50%
5. If healthy: Increase to 100%
6. If issues: Route 100% back to old version
Advantages:
✅ Instant switchover
✅ Easy rollback (switch back to blue)
✅ Zero downtime
Process:
1. Blue (v1.0) = 100% traffic
2. Deploy Green (v1.1) = 0% traffic
3. Test Green thoroughly
4. Switch 100% traffic to Green
5. Keep Blue for rollback (24h)
Advantages:
✅ Simple, built-in
✅ Gradual replacement
Process:
1. Kubernetes replaces pods one-by-one
2. Waits for new pod to be ready
3. Then replaces next pod
4. maxSurge: 1, maxUnavailable: 0 (zero downtime)
Frontend:
- CloudFront CDN (global distribution)
- S3 (static assets)
Application:
- EKS (Kubernetes cluster)
- ALB (Application Load Balancer)
- Auto Scaling Groups
Data:
- RDS PostgreSQL (primary database)
- ElastiCache Redis (caching)
- S3 (file storage)
Monitoring:
- CloudWatch (logs, metrics)
- Datadog (APM)
Frontend:
- Cloud CDN
- Cloud Storage
Application:
- GKE (Kubernetes cluster)
- Cloud Load Balancing
Data:
- Cloud SQL PostgreSQL
- Memorystore Redis
- Cloud Storage
Monitoring:
- Cloud Monitoring
- Cloud Logging
Before production deployment:
## Build & Tests ✅
- [ ] All tests passing in CI/CD
- [ ] Code coverage ≥ 80%
- [ ] No linting errors
- [ ] Type check passed
## Security ✅
- [ ] Security scan passed (no critical)
- [ ] Dependencies updated
- [ ] Secrets in vault (not code)
- [ ] TLS configured
## Performance ✅
- [ ] Load test passed
- [ ] API latency benchmarks met
- [ ] Database queries optimized
## Deployment ✅
- [ ] Staging deployment successful
- [ ] Smoke tests passed
- [ ] Rollback tested
- [ ] Runbook updated
## Monitoring ✅
- [ ] Application metrics configured
- [ ] Alert rules active
- [ ] Dashboards created
- [ ] On-call rotation setup
**DevOps Sign-Off**: _______________
**Date**: _______________
Automation: 100% - no manual deployments Deployment Frequency: Multiple per day (small changes) Lead Time: < 1 hour (commit to production) MTTR: < 30 min (mean time to recovery) Change Failure Rate: < 5%
Key Principles:
🚀 *PRODUCTION DEPLOYMENT STARTED*
• Version: v1.2.3
• Commit: abc123def
• Strategy: Canary (10% → 50% → 100%)
• ETA: 30 minutes
• Deployed by: Marcus Chen
• Monitoring: https://datadog.com/dashboard/abc
*Canary Status*:
[⏳] 10% deployed, monitoring...
🚨 *INCIDENT: High Error Rate*
• Service: todo-app-production
• Error rate: 12% (threshold: 5%)
• Started: 2026-02-04 14:23 UTC
• Impact: 12% of users affected
• Assigned: On-call DevOps
*Actions*:
1. Rollback to v1.2.2 (in progress)
2. Investigate error logs
3. Post-mortem scheduled
🚀 Automate everything - Manual processes don't scale to 8000 deploys/day 📊 Monitor proactively - Know about issues before customers 🔄 Deploy often - Small changes = lower risk, easier to debug ⏪ Rollback easily - Always have escape hatch (< 30 seconds) 📖 Document everything - Runbooks save hours during incidents
For detailed procedures, read the reference guides in references/ folder
Created: 2026-02-04 Maintained By: Netflix DevOps Engineer Review Cycle: After each major deployment