Self-Service Infrastructure

Patterns for enabling developers to provision infrastructure without tickets, while maintaining governance and control.

When to Use This Skill

Designing infrastructure self-service capabilities
Creating reusable Terraform/Pulumi modules
Building environment provisioning systems
Implementing infrastructure guardrails
Reducing infrastructure request bottlenecks
Balancing developer autonomy with governance

Self-Service Fundamentals

What is Self-Service Infrastructure?

Self-Service Infrastructure:
Enabling developers to provision and manage infrastructure
directly, without filing tickets or waiting for ops teams.

Traditional Model:
┌─────────────────────────────────────────────────────────────┐
│ Developer → Ticket → Ops Review → Manual Provision → Done  │
│                                                              │
│ Timeline: Days to weeks                                      │
│ Bottleneck: Ops team capacity                               │
│ Result: Shadow IT, workarounds, frustration                 │
└─────────────────────────────────────────────────────────────┘

Self-Service Model:
┌─────────────────────────────────────────────────────────────┐
│ Developer → Portal/API → Automatic Provision → Done         │
│                                                              │
│ Timeline: Minutes to hours                                  │
│ Bottleneck: None (automated)                                │
│ Result: Speed, consistency, compliance                      │
└─────────────────────────────────────────────────────────────┘

Self-Service Spectrum:
├── Fully Managed: Click a button, get a database
├── Template-Based: Customize from approved templates
├── Policy-Constrained: Write IaC within guardrails
└── Full Freedom: Any infrastructure (risky)

Sweet Spot: Template-Based with Policy Guardrails

Key Benefits

Self-Service Benefits:

For Developers:
├── Speed: Minutes instead of days
├── Autonomy: Provision when needed
├── Consistency: Same infrastructure every time
├── Learning: Understand infrastructure better
└── Ownership: More responsibility, more control

For Operations:
├── Scale: Handle more requests without more people
├── Consistency: Enforce standards automatically
├── Focus: Work on platform, not tickets
├── Audit: Clear trail of who provisioned what
└── Compliance: Built-in policy enforcement

For Organization:
├── Velocity: Faster time to market
├── Cost: Reduced ops overhead
├── Governance: Better compliance posture
├── Security: Consistent security controls
└── Efficiency: Resources provisioned when needed

Self-Service Architecture

Component Architecture

Self-Service Infrastructure Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     USER INTERFACE                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Portal    │  │    CLI      │  │    API      │         │
│  │   (Web UI)  │  │ (Terraform) │  │  (REST/gRPC)│         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
│         └────────────────┼────────────────┘                 │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               ORCHESTRATION LAYER                    │    │
│  │  ├── Request validation                              │    │
│  │  ├── Policy evaluation (OPA/Sentinel)               │    │
│  │  ├── Cost estimation                                 │    │
│  │  ├── Approval workflow (if needed)                  │    │
│  │  └── Execution orchestration                        │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               TEMPLATE LIBRARY                       │    │
│  │  ├── Database modules (RDS, Cloud SQL)              │    │
│  │  ├── Compute modules (EKS, GKE, VMs)               │    │
│  │  ├── Storage modules (S3, GCS)                      │    │
│  │  ├── Network modules (VPC, subnets)                 │    │
│  │  └── Composite modules (full environments)          │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               EXECUTION ENGINE                       │    │
│  │  ├── Terraform Cloud/Enterprise                     │    │
│  │  ├── Pulumi Service                                 │    │
│  │  ├── Crossplane                                     │    │
│  │  └── Cloud-native (CDK, ARM, Deployment Manager)   │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
├──────────────────────────┼───────────────────────────────────┤
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               CLOUD PROVIDERS                        │    │
│  │  AWS  │  GCP  │  Azure  │  Kubernetes  │  Others    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Request Flow

Self-Service Request Flow:

┌─────────────────────────────────────────────────────────────┐
│ 1. REQUEST                                                   │
│    Developer: "I need a PostgreSQL database for staging"    │
│    └── Via portal, CLI, or API                              │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. VALIDATION                                                │
│    ├── User has permission?          ✓ Team member          │
│    ├── Request well-formed?          ✓ Valid config         │
│    ├── Within quotas?                ✓ Under team limit     │
│    └── Meets policy?                 ✓ Allowed instance type│
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. ENRICHMENT                                                │
│    ├── Apply defaults                 db.t3.medium          │
│    ├── Generate names                 myapp-staging-db      │
│    ├── Assign network                 staging-vpc           │
│    ├── Configure monitoring           Datadog integration   │
│    └── Estimate cost                  ~$50/month            │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. APPROVAL (if required)                                    │
│    ├── Auto-approve: staging, dev     ✓ Auto-approved       │
│    ├── Manual approve: production     (Would need approval) │
│    └── Cost threshold: >$500/month    (Would need approval) │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. EXECUTION                                                 │
│    ├── Generate Terraform             Based on template     │
│    ├── Plan                           Preview changes       │
│    ├── Apply                          Create resources      │
│    └── Verify                         Health checks         │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DELIVERY                                                  │
│    ├── Connection string → Vault                            │
│    ├── Notification → Slack/email                           │
│    ├── Documentation → Auto-generated                       │
│    └── Registration → Service catalog                       │
└─────────────────────────────────────────────────────────────┘

IaC Module Design

Terraform Module Patterns

Terraform Module Structure:

Organization-Wide Module Library:
terraform-modules/
├── databases/
│   ├── rds-postgres/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── README.md
│   │   └── examples/
│   │       ├── simple/
│   │       └── production/
│   └── elasticache-redis/
├── compute/
│   ├── eks-cluster/
│   └── ecs-service/
├── storage/
│   └── s3-bucket/
└── network/
    └── vpc/

Module Design Principles:

1. Opinionated Defaults
   # variables.tf
   variable "instance_class" {
     type        = string
     default     = "db.t3.medium"  # Sensible default
     description = "RDS instance type"

     validation {
       condition = can(regex("^db\\.(t3|r5|m5)", var.instance_class))
       error_message = "Only approved instance families allowed."
     }
   }

2. Minimal Required Inputs
   # Only require what can't be defaulted
   variable "name" {
     type        = string
     description = "Database identifier"
   }

   variable "environment" {
     type        = string
     description = "Environment (dev, staging, prod)"
   }

3. Complete Outputs
   # outputs.tf
   output "endpoint" {
     description = "Database connection endpoint"
     value       = aws_db_instance.main.endpoint
   }

   output "connection_secret_arn" {
     description = "ARN of secret with credentials"
     value       = aws_secretsmanager_secret.db_credentials.arn
   }

4. Built-in Best Practices
   # Security hardened by default
   resource "aws_db_instance" "main" {
     # Encryption always on
     storage_encrypted = true

     # No public access
     publicly_accessible = false

     # Automated backups
     backup_retention_period = var.environment == "prod" ? 30 : 7

     # Enhanced monitoring
     monitoring_interval = 60
   }

Module Versioning

Module Versioning Strategy:

Semantic Versioning:
├── MAJOR: Breaking changes (new required inputs, removed outputs)
├── MINOR: New features (new optional inputs, new outputs)
└── PATCH: Bug fixes (no interface changes)

Version Constraints:
# Allow patch updates automatically
module "database" {
  source  = "terraform.company.com/modules/rds-postgres"
  version = "~> 2.1.0"  # >=2.1.0, <2.2.0
}

# Pin to exact version (production)
module "database" {
  source  = "terraform.company.com/modules/rds-postgres"
  version = "= 2.1.3"
}

Deprecation Policy:
┌─────────────────────────────────────────────────────────────┐
│ Module Version Lifecycle                                     │
├─────────────────────────────────────────────────────────────┤
│ Current (v2.x):     Supported, new features                 │
│ Previous (v1.x):    Supported, security fixes only          │
│ Deprecated (v0.x):  Warning on use, no support              │
│ Removed:            Will not work                           │
│                                                              │
│ Notification:                                                │
│ ├── Slack announcement when version deprecated              │
│ ├── Warning in terraform plan output                        │
│ ├── Dashboard showing deprecated module usage               │
│ └── Migration guide provided                                │
└─────────────────────────────────────────────────────────────┘

Policy and Guardrails

Policy as Code

Policy as Code Options:

1. HashiCorp Sentinel (Terraform Enterprise)
   # Require encryption for all storage
   import "tfplan/v2" as tfplan

   s3_buckets = filter tfplan.resource_changes as _, rc {
     rc.type is "aws_s3_bucket" and
     rc.mode is "managed" and
     (rc.change.actions contains "create" or
      rc.change.actions contains "update")
   }

   encryption_enabled = rule {
     all s3_buckets as _, bucket {
       bucket.change.after.server_side_encryption_configuration
         is not null
     }
   }

   main = rule { encryption_enabled }

2. Open Policy Agent (OPA)
   # Rego policy for Kubernetes
   package kubernetes.admission

   deny[msg] {
     input.request.kind.kind == "Pod"
     container := input.request.object.spec.containers[_]
     not container.securityContext.runAsNonRoot
     msg := "Containers must run as non-root"
   }

3. Cloud-Native Policies
   # AWS Service Control Policy
   {
     "Version": "2012-10-17",
     "Statement": [{
       "Sid": "RequireEncryption",
       "Effect": "Deny",
       "Action": ["s3:CreateBucket"],
       "Resource": "*",
       "Condition": {
         "StringNotEquals": {
           "s3:x-amz-server-side-encryption": "AES256"
         }
       }
     }]
   }

Guardrail Categories

Infrastructure Guardrails:

1. Security Guardrails
   ├── Encryption required (at-rest, in-transit)
   ├── No public access by default
   ├── Required security groups
   ├── IAM role requirements
   └── Vulnerability scanning

2. Cost Guardrails
   ├── Instance type restrictions
   ├── Storage size limits
   ├── Required cost tags
   ├── Budget thresholds
   └── Approval for large resources

3. Compliance Guardrails
   ├── Allowed regions (data residency)
   ├── Required logging
   ├── Backup requirements
   ├── Retention policies
   └── Audit trail requirements

4. Operational Guardrails
   ├── Naming conventions
   ├── Required tags (owner, cost-center)
   ├── Resource quotas per team
   ├── Monitoring requirements
   └── Deletion protection

Guardrail Implementation:
┌─────────────────────────────────────────────────────────────┐
│                    Guardrail Timing                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Pre-Plan (fastest feedback):                               │
│  ├── Validate terraform files                               │
│  ├── Static analysis (tfsec, checkov)                      │
│  └── Module version checks                                  │
│                                                              │
│  Post-Plan (resource-aware):                                │
│  ├── OPA/Sentinel policy evaluation                        │
│  ├── Cost estimation                                        │
│  └── Blast radius assessment                                │
│                                                              │
│  Post-Apply (verification):                                 │
│  ├── Configuration validation                               │
│  ├── Security scanning                                      │
│  └── Compliance audit                                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Environment Provisioning

Environment Templates

Environment Provisioning:

Environment Types:
┌─────────────────────────────────────────────────────────────┐
│ Development Environment                                      │
│ ├── Purpose: Individual developer testing                   │
│ ├── Lifetime: Hours to days                                 │
│ ├── Resources: Minimal (smallest instances)                 │
│ ├── Data: Synthetic or anonymized                           │
│ └── Approval: None (within quota)                           │
├─────────────────────────────────────────────────────────────┤
│ Staging Environment                                          │
│ ├── Purpose: Integration testing, QA                        │
│ ├── Lifetime: Persistent per service                        │
│ ├── Resources: Production-like (scaled down)                │
│ ├── Data: Sanitized production subset                       │
│ └── Approval: None (within quota)                           │
├─────────────────────────────────────────────────────────────┤
│ Production Environment                                       │
│ ├── Purpose: Live customer traffic                          │
│ ├── Lifetime: Permanent                                      │
│ ├── Resources: Full capacity                                │
│ ├── Data: Real customer data                                │
│ └── Approval: Required (security review)                    │
└─────────────────────────────────────────────────────────────┘

Environment Template:
# environment/main.tf
module "network" {
  source      = "../modules/vpc"
  environment = var.environment
  cidr_block  = var.network_cidr
}

module "kubernetes" {
  source      = "../modules/eks"
  environment = var.environment
  vpc_id      = module.network.vpc_id
  node_count  = var.environment == "prod" ? 5 : 2
}

module "database" {
  source         = "../modules/rds"
  environment    = var.environment
  vpc_id         = module.network.vpc_id
  instance_class = var.environment == "prod" ? "db.r5.xlarge" : "db.t3.medium"
  multi_az       = var.environment == "prod"
}

module "cache" {
  source      = "../modules/elasticache"
  environment = var.environment
  vpc_id      = module.network.vpc_id
  node_type   = var.environment == "prod" ? "cache.r5.large" : "cache.t3.micro"
}

Ephemeral Environments

Ephemeral/Preview Environments:

Use Cases:
├── PR preview environments
├── Feature branch testing
├── Demo environments
├── Load testing environments
└── Incident reproduction

Lifecycle:
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  PR Created ──► Environment Created ──► Tests Run           │
│       │              │                      │               │
│       │              ▼                      ▼               │
│       │         Preview URL            PR Updated           │
│       │         Posted to PR              │                 │
│       │                                   │                 │
│       ▼                                   ▼                 │
│  PR Merged ───────────────────────► Environment Destroyed   │
│                                                              │
│  Timeout: Auto-destroy after 7 days of inactivity          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation:
# .github/workflows/preview.yml
name: Preview Environment

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  deploy-preview:
    runs-on: ubuntu-latest
    steps:
      - name: Create/Update Environment
        run: |
          terraform workspace select pr-${{ github.event.pull_request.number }} || \
          terraform workspace new pr-${{ github.event.pull_request.number }}
          terraform apply -auto-approve

      - name: Comment Preview URL
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: '🚀 Preview: https://pr-${{ github.event.pull_request.number }}.preview.company.com'
            })

Technology Options

Self-Service Platforms

Platform Comparison:

1. Terraform Cloud/Enterprise
   ├── Native Terraform experience
   ├── Policy as Code (Sentinel)
   ├── Private module registry
   ├── Cost estimation
   └── Enterprise features (SSO, audit)

2. Pulumi
   ├── Real programming languages
   ├── Strong typing and IDE support
   ├── Policy as Code (CrossGuard)
   └── Automation API

3. Crossplane
   ├── Kubernetes-native
   ├── GitOps workflow
   ├── Composition for modules
   └── Multi-cloud abstraction

4. Backstage + Terraform
   ├── Unified developer portal
   ├── Software templates
   ├── Plugin ecosystem
   └── Service catalog integration

5. Port/Cortex/OpsLevel
   ├── Commercial developer portals
   ├── Quick to implement
   ├── Built-in integrations
   └── Self-service workflows

Selection Criteria:
┌────────────────────────────────────────────────────────────┐
│ Factor               │ Best Fit                            │
├──────────────────────┼─────────────────────────────────────┤
│ Existing Terraform   │ Terraform Cloud/Enterprise         │
│ Kubernetes-first     │ Crossplane                         │
│ Developer portal     │ Backstage or commercial            │
│ Programming language │ Pulumi                             │
│ Quick start          │ Commercial (Port, OpsLevel)        │
│ Maximum control      │ Build custom                       │
└────────────────────────────────────────────────────────────┘

Cost Management

Cost Controls

Cost Management in Self-Service:

1. Cost Visibility
   ├── Estimated cost shown before provisioning
   ├── Cost tags automatically applied
   ├── Per-team/project dashboards
   └── Anomaly detection and alerts

2. Cost Guardrails
   ├── Instance type restrictions
   ├── Budget thresholds by team
   ├── Approval required above threshold
   └── Auto-shutdown of unused resources

3. Cost Optimization
   ├── Right-sizing recommendations
   ├── Reserved instance suggestions
   ├── Spot instance for non-production
   └── Scheduled scaling

Cost Estimation Flow:
┌─────────────────────────────────────────────────────────────┐
│ Request: PostgreSQL database for staging                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Cost Estimate:                                             │
│  ├── Compute (db.t3.medium):        $30/month              │
│  ├── Storage (100GB gp3):           $10/month              │
│  ├── Backup storage:                ~$5/month              │
│  └── Data transfer:                 ~$5/month              │
│                                     ─────────               │
│  Estimated Total:                   ~$50/month             │
│                                                              │
│  ✓ Within team budget ($500/month quota)                   │
│  ✓ No approval required                                     │
│                                                              │
│  [Proceed] [Modify] [Cancel]                                │
└─────────────────────────────────────────────────────────────┘

Best Practices

Self-Service Infrastructure Best Practices:

1. Start Small, Expand Gradually
   ├── Begin with 2-3 common resources
   ├── Add based on demand
   ├── Iterate on feedback
   └── Don't try to cover everything day 1

2. Balance Autonomy and Governance
   ├── Guardrails not gates
   ├── Automate approvals where safe
   ├── Clear escalation paths
   └── Trust but verify

3. Optimize for Developer Experience
   ├── Minimal required inputs
   ├── Sensible defaults
   ├── Clear error messages
   └── Fast feedback loops

4. Maintain Module Quality
   ├── Automated testing
   ├── Documentation requirements
   ├── Versioning strategy
   └── Deprecation process

5. Monitor and Improve
   ├── Track provisioning success rate
   ├── Measure time to provision
   ├── Gather user feedback
   └── Identify automation opportunities

6. Handle Edge Cases
   ├── What if provisioning fails?
   ├── How to handle orphaned resources?
   ├── What about existing resources?
   └── How to migrate between versions?

Anti-Patterns

Self-Service Anti-Patterns:

1. "Self-Service Everything"
   ❌ Every possible configuration option
   ✓ Curated set of approved patterns

2. "Security Theater"
   ❌ Manual approvals that don't add value
   ✓ Automated policy enforcement

3. "Configuration Explosion"
   ❌ 50 parameters per resource
   ✓ Sensible defaults with few overrides

4. "Ignore Cost"
   ❌ No visibility into provisioned cost
   ✓ Cost estimation and budgets

5. "Build vs Buy Wrong"
   ❌ Building everything from scratch
   ✓ Use existing tools where appropriate

6. "No Escape Hatch"
   ❌ Blocking legitimate exceptions
   ✓ Process for justified deviations

Related Skills

internal-developer-platform - Platform engineering overview
golden-paths - Standardized workflows
container-orchestration - Kubernetes infrastructure
serverless-patterns - Serverless infrastructure

self-service-infrastructure

Self-Service Infrastructure

When to Use This Skill

Self-Service Fundamentals

What is Self-Service Infrastructure?

Key Benefits

Self-Service Architecture

Component Architecture

Request Flow

IaC Module Design

Terraform Module Patterns

Module Versioning

Policy and Guardrails

Policy as Code

Guardrail Categories

Environment Provisioning

Environment Templates

Ephemeral Environments

Technology Options

Self-Service Platforms

Cost Management

Cost Controls

Best Practices

Anti-Patterns

Related Skills

Similar Skills