DevOps Patterns

This skill provides comprehensive guidance for implementing DevOps practices, automation, and deployment strategies.

CI/CD Pipeline Design

Pipeline Stages

# Complete CI/CD Pipeline
stages:
  - lint          # Code quality checks
  - test          # Run test suite
  - build         # Build artifacts
  - scan          # Security scanning
  - deploy-dev    # Deploy to development
  - deploy-staging # Deploy to staging
  - deploy-prod   # Deploy to production

Pipeline Best Practices

1. Fast Feedback: Run fastest checks first

jobs:
  # Quick checks first (1-2 minutes)
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm run lint

  type-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm run type-check

  # Longer tests after (5-10 minutes)
  test:
    needs: [lint, type-check]
    runs-on: ubuntu-latest
    steps:
      - run: npm test

2. Fail Fast: Stop pipeline on first failure 3. Idempotent: Running twice produces same result 4. Versioned: Pipeline config in version control

GitHub Actions Patterns

Basic Workflow Structure

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: '18'

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/coverage-final.json

Reusable Workflows

# .github/workflows/reusable-test.yml
name: Reusable Test Workflow

on:
  workflow_call:
    inputs:
      node-version:
        required: true
        type: string
    secrets:
      DATABASE_URL:
        required: true

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: ${{ inputs.node-version }}
      - run: npm ci
      - run: npm test
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}

# Use in another workflow
# .github/workflows/main.yml
jobs:
  call-test:
    uses: ./.github/workflows/reusable-test.yml
    with:
      node-version: '18'
    secrets:
      DATABASE_URL: ${{ secrets.DATABASE_URL }}

Matrix Strategy

# Test across multiple versions
jobs:
  test:
    strategy:
      matrix:
        node-version: [16, 18, 20]
        os: [ubuntu-latest, windows-latest, macos-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm test

Custom Actions

# .github/actions/deploy/action.yml
name: 'Deploy Application'
description: 'Deploy to specified environment'
inputs:
  environment:
    description: 'Target environment'
    required: true
  api-key:
    description: 'Deployment API key'
    required: true

runs:
  using: 'composite'
  steps:
    - run: |
        echo "Deploying to ${{ inputs.environment }}"
        ./deploy.sh ${{ inputs.environment }}
      env:
        API_KEY: ${{ inputs.api-key }}
      shell: bash

# Usage
jobs:
  deploy:
    steps:
      - uses: ./.github/actions/deploy
        with:
          environment: production
          api-key: ${{ secrets.DEPLOY_KEY }}

Conditional Execution

jobs:
  deploy:
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    steps:
      - name: Deploy to production
        run: ./deploy.sh production

  notify:
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - name: Send failure notification
        uses: slack/notify@v2
        with:
          message: 'Build failed!'

Infrastructure as Code (Terraform)

Project Structure

terraform/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── eks/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── prod/
│       ├── main.tf
│       └── terraform.tfvars
└── global/
    └── s3/
        └── main.tf

VPC Module Example

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.environment}-vpc"
    Environment = var.environment
  }
}

resource "aws_subnet" "public" {
  count             = length(var.public_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.public_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = {
    Name = "${var.environment}-public-${count.index + 1}"
  }
}

# modules/vpc/variables.tf
variable "environment" {
  description = "Environment name"
  type        = string
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
}

variable "public_subnet_cidrs" {
  description = "CIDR blocks for public subnets"
  type        = list(string)
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

# modules/vpc/outputs.tf
output "vpc_id" {
  value = aws_vpc.main.id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

Using Modules

# environments/prod/main.tf
terraform {
  required_version = ">= 1.0"

  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = "us-east-1"
}

module "vpc" {
  source = "../../modules/vpc"

  environment          = "prod"
  vpc_cidr            = "10.0.0.0/16"
  public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24"]
  availability_zones  = ["us-east-1a", "us-east-1b"]
}

module "eks" {
  source = "../../modules/eks"

  cluster_name    = "prod-cluster"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.public_subnet_ids
  node_count      = 3
  node_instance_type = "t3.large"
}

Docker Best Practices

Multi-Stage Builds

# Build stage
FROM node:18-alpine AS builder

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Build application
RUN npm run build

# Production stage
FROM node:18-alpine AS production

WORKDIR /app

# Copy only necessary files from builder
COPY --from=builder /app/package*.json ./
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

USER nodejs

EXPOSE 3000

CMD ["node", "dist/index.js"]

Layer Optimization

# ✅ GOOD - Dependencies cached separately
FROM node:18-alpine

WORKDIR /app

# Copy package files first (rarely change)
COPY package*.json ./
RUN npm ci

# Copy source code (changes frequently)
COPY . .
RUN npm run build

# ❌ BAD - Everything in one layer
FROM node:18-alpine
WORKDIR /app
COPY . .
RUN npm ci && npm run build
# Cache invalidated on every source change

Security Best Practices

# ✅ Use specific versions
FROM node:18.17.1-alpine

# ✅ Run as non-root user
RUN addgroup -g 1001 nodejs && \
    adduser -S nodejs -u 1001
USER nodejs

# ✅ Use .dockerignore
# .dockerignore:
node_modules
.git
.env
*.md
.github

# ✅ Scan for vulnerabilities
# docker scan myapp:latest

# ✅ Use minimal base images
FROM node:18-alpine  # Not node:18 (full)

# ✅ Don't include secrets
# Use build args or runtime env vars
ARG API_KEY
ENV API_KEY=${API_KEY}

Docker Compose for Development

# docker-compose.yml
version: '3.8'

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile.dev
    ports:
      - '3000:3000'
    volumes:
      - .:/app
      - /app/node_modules
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgresql://user:pass@db:5432/mydb
    depends_on:
      - db
      - redis

  db:
    image: postgres:15-alpine
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=mydb
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - '6379:6379'

volumes:
  postgres_data:

Kubernetes Patterns

Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: myapp:1.0.0
        ports:
        - containerPort: 3000
        env:
        - name: NODE_ENV
          value: production
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: myapp-secrets
              key: database-url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5

Service

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 3000
  type: LoadBalancer

ConfigMap and Secrets

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-config
data:
  LOG_LEVEL: info
  MAX_CONNECTIONS: "100"

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: myapp-secrets
type: Opaque
data:
  database-url: cG9zdGdyZXNxbDovL3VzZXI6cGFzc0BkYjU0MzIvbXlkYg==
  api-key: c2tfbGl2ZV9hYmMxMjN4eXo=

Ingress

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - myapp.example.com
    secretName: myapp-tls
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-service
            port:
              number: 80

Deployment Strategies

Blue-Green Deployment

# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:1.0.0

---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:2.0.0

---
# Service (switch by changing selector)
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80
    targetPort: 3000

Canary Deployment

# Stable deployment (90% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: myapp
      track: stable

---
# Canary deployment (10% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
      track: canary

---
# Service routes to both
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp  # Matches both stable and canary
  ports:
  - port: 80
    targetPort: 3000

Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Max 2 extra pods during update
      maxUnavailable: 1  # Max 1 pod unavailable during update
  selector:
    matchLabels:
      app: myapp
  template:
    spec:
      containers:
      - name: myapp
        image: myapp:2.0.0

Database Migration Strategies

Forward-Only Migrations

// ✅ GOOD - Backwards compatible
// Step 1: Add new column (nullable)
await db.schema.alterTable('users', (table) => {
  table.string('phone_number').nullable();
});

// Step 2: Populate data
await db('users').update({
  phone_number: db.raw('contact_info'),
});

// Step 3: Make non-nullable (separate deployment)
await db.schema.alterTable('users', (table) => {
  table.string('phone_number').notNullable().alter();
});

// Step 4: Drop old column (separate deployment)
await db.schema.alterTable('users', (table) => {
  table.dropColumn('contact_info');
});

Zero-Downtime Migrations

// Rename column without downtime

// Migration 1: Add new column
await db.schema.alterTable('users', (table) => {
  table.string('email_address').nullable();
});

// Update application code to write to both columns
class User {
  async save() {
    await db('users').update({
      email: this.email,
      email_address: this.email, // Write to both
    });
  }
}

// Migration 2: Backfill data
await db.raw(`
  UPDATE users
  SET email_address = email
  WHERE email_address IS NULL
`);

// Migration 3: Update app to read from new column
class User {
  get email() {
    return this.email_address; // Read from new column
  }
}

// Migration 4: Drop old column
await db.schema.alterTable('users', (table) => {
  table.dropColumn('email');
});

Environment Management

Environment Configuration

// config/environments.ts
interface EnvironmentConfig {
  database: {
    host: string;
    port: number;
    name: string;
  };
  api: {
    baseUrl: string;
    timeout: number;
  };
  features: {
    enableNewFeature: boolean;
  };
}

const environments: Record<string, EnvironmentConfig> = {
  development: {
    database: {
      host: 'localhost',
      port: 5432,
      name: 'myapp_dev',
    },
    api: {
      baseUrl: 'http://localhost:3000',
      timeout: 30000,
    },
    features: {
      enableNewFeature: true,
    },
  },
  staging: {
    database: {
      host: 'staging-db.example.com',
      port: 5432,
      name: 'myapp_staging',
    },
    api: {
      baseUrl: 'https://staging-api.example.com',
      timeout: 10000,
    },
    features: {
      enableNewFeature: true,
    },
  },
  production: {
    database: {
      host: process.env.DB_HOST!,
      port: parseInt(process.env.DB_PORT!),
      name: 'myapp_prod',
    },
    api: {
      baseUrl: 'https://api.example.com',
      timeout: 5000,
    },
    features:  {
      enableNewFeature: false,
    },
  },
};

export const config = environments[process.env.NODE_ENV || 'development'];

Monitoring

Prometheus Metrics

import prometheus from 'prom-client';

// Create metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
});

const httpRequestTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode.toString())
      .observe(duration);

    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode.toString())
      .inc();
  });

  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', prometheus.register.contentType);
  res.end(await prometheus.register.metrics());
});

Grafana Dashboard

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ]
      }
    ]
  }
}

Log Aggregation

// Winston logger with JSON format
import winston from 'winston';

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'myapp',
    environment: process.env.NODE_ENV,
  },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
  ],
});

// Structured logging
logger.info('User logged in', {
  userId: user.id,
  email: user.email,
  ip: req.ip,
});

Disaster Recovery

Backup Strategy

#!/bin/bash
# backup-database.sh

# Configuration
DB_HOST="${DB_HOST}"
DB_NAME="${DB_NAME}"
BACKUP_DIR="/backups"
S3_BUCKET="s3://my-backups"
RETENTION_DAYS=30

# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Dump database
pg_dump -h "${DB_HOST}" -U postgres "${DB_NAME}" | gzip > "${BACKUP_FILE}"

# Upload to S3
aws s3 cp "${BACKUP_FILE}" "${S3_BUCKET}/"

# Remove local backup
rm "${BACKUP_FILE}"

# Delete old backups from S3
aws s3 ls "${S3_BUCKET}/" | while read -r line; do
  FILE_DATE=$(echo "$line" | awk '{print $1}')
  FILE_NAME=$(echo "$line" | awk '{print $4}')

  FILE_EPOCH=$(date -d "$FILE_DATE" +%s)
  CURRENT_EPOCH=$(date +%s)
  DAYS_OLD=$(( (CURRENT_EPOCH - FILE_EPOCH) / 86400 ))

  if [ $DAYS_OLD -gt $RETENTION_DAYS ]; then
    aws s3 rm "${S3_BUCKET}/${FILE_NAME}"
  fi
done

Recovery Plan

## Disaster Recovery Plan

### RTO (Recovery Time Objective): 4 hours
### RPO (Recovery Point Objective): 1 hour

### Recovery Steps:

1. **Assess the situation**
   - Identify scope of failure
   - Notify stakeholders

2. **Restore database**
   ```bash
   # Download latest backup
   aws s3 cp s3://my-backups/latest.sql.gz /tmp/

   # Restore database
   gunzip -c /tmp/latest.sql.gz | psql -h new-db -U postgres myapp

Deploy application

# Deploy to new infrastructure
kubectl apply -f k8s/production/

# Update DNS
aws route53 change-resource-record-sets ...

Verify recovery
- Run smoke tests
- Check monitoring dashboards
- Verify critical features
Post-mortem
- Document incident
- Identify root cause
- Create action items


## When to Use This Skill

Use this skill when:
- Setting up CI/CD pipelines
- Deploying applications
- Managing infrastructure
- Implementing deployment strategies
- Configuring monitoring
- Planning disaster recovery
- Containerizing applications
- Orchestrating with Kubernetes
- Automating workflows
- Scaling infrastructure

---

**Remember**: DevOps is about automation, reliability, and continuous improvement. Invest in your infrastructure and deployment processes to enable faster, safer releases.

devops-patterns