From cloud-foundation-principles
This skill should be used when the user is structuring Terraform repositories, deciding between mono-repo and multi-repo strategies, organizing infrastructure into layers, designing state management architecture, setting up cross-layer dependencies, or evaluating blast radius of infrastructure changes. Covers multi-repository strategy, numbered layer architecture, state-per-layer-per-environment isolation, cross-layer remote state references, and deployment ordering.
npx claudepluginhub oborchers/fractional-cto --plugin cloud-foundation-principlesThis skill uses the workspace's default tool permissions.
A single Terraform state file that contains your entire infrastructure is a liability disguised as simplicity. One bad `terraform apply` can destroy your network, databases, and compute in a single operation. One state file corruption locks out every team. One slow plan blocks every deployment. The blast radius is everything, and the recovery plan is "restore from backup and pray."
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
A single Terraform state file that contains your entire infrastructure is a liability disguised as simplicity. One bad terraform apply can destroy your network, databases, and compute in a single operation. One state file corruption locks out every team. One slow plan blocks every deployment. The blast radius is everything, and the recovery plan is "restore from backup and pray."
Production infrastructure demands intentional separation -- separate repositories for separate concerns, separate state files for separate layers, and numbered directories that encode dependency order at a glance. This is not premature optimization. It is the difference between an outage that takes down one monitoring dashboard and an outage that takes down your entire platform.
Infrastructure repositories should be split by change cadence and ownership. Organization-level IAM changes happen monthly. Network changes happen quarterly. Service deployments happen daily. Forcing all three through the same repository, the same review process, and the same CI pipeline creates friction where none should exist.
REPOSITORIES
|
+-- tf-root <-- Organization & IAM management
| Scope: SSO, permissions, accounts, security delegation
| Changes: Monthly (new users, permission updates)
|
+-- tf-global-infrastructure <-- Shared infrastructure per environment
| Scope: VPCs, security groups, databases, compute clusters, monitoring
| Structure: Numbered layers (00-90) with env subdirectories
| Changes: Weekly (new resources, configuration updates)
|
+-- tf-module-labels <-- Foundational naming/tagging module
+-- tf-module-alerts <-- Monitoring/alerting module
+-- tf-module-container-service <-- Container orchestration module
|
+-- [per-service repos] <-- App-specific infrastructure
Each service manages its own Terraform alongside application code
Changes: Daily (deployments, scaling, feature flags)
Why separate repos, not directories in a mono-repo?
Within the global infrastructure repository, directories are numbered to encode dependency order. Lower numbers are prerequisites for higher numbers. Numbering in steps of ten (00, 10, 20 ... not 1, 2, 3) reserves space to insert or split layers without renumbering -- if databases grow complex, split 30_databases into 30_relational and 35_caches without touching anything else.
tf-global-infrastructure/
+-- 00_network/ <-- VPCs, subnets, Route53, VPN, VPC endpoints
| +-- dev/
| +-- prod/
+-- 10_security/ <-- Security groups, IAM roles, KMS, WAF, certificates
| +-- dev/
| +-- prod/
+-- 20_storage/ <-- Object storage, file systems
| +-- dev/
| +-- prod/
+-- 30_databases/ <-- Relational databases, caches, warehouses
| +-- dev/
| +-- prod/
+-- 40_compute/ <-- Container clusters, auto-scaling, GPU instances
| +-- dev/
| +-- prod/
+-- 50_edge/ <-- CDN, load balancers, API gateways
| +-- prod/
+-- 60_messaging/ <-- Message brokers, event buses, queues
| +-- dev/
| +-- prod/
+-- 70_monitoring/ <-- Metrics, dashboards, log aggregation
| +-- dev/
| +-- prod/
+-- 80_ci_cd/ <-- Build runners, pipeline infrastructure
+-- 90_shared_services/ <-- Bastion hosts, service discovery
+-- dev/
+-- prod/
Not every layer needs per-environment subdirectories. Layers like 80_ci_cd (e.g., self-hosted GitHub runners) are shared infrastructure — there is no reason to duplicate build runners per environment in a startup. Similarly, 50_edge may only exist in production if there is no dev CDN or load balancer. Only create environment subdirectories where the resources are actually environment-specific.
| Property | Benefit |
|---|---|
| Dependency encoding | Layer 40 (compute) cannot exist without layer 00 (network). The numbering makes this obvious. |
| Independent state | Each layer has its own state file. A bad apply in monitoring cannot destroy your network. |
| Independent CI/CD | Each layer can have its own pipeline. Network changes do not block compute deployments. |
| Clear mental model | New engineers understand the dependency graph in seconds, not hours. |
| Insert and split | Need to split databases into relational and caches? Insert 35_caches between 30 and 40 without renumbering anything. |
Layers deploy in numerical order. This is the full dependency chain:
tf-root (organization setup, SSO, security delegation)
|
v
00_network (VPCs, subnets, DNS, VPC endpoints)
|
v
10_security (security groups, KMS keys, certificates, WAF)
|
v
20_storage (object storage, file systems)
|
v
30_databases (relational databases, caches, warehouses)
|
v
40_compute (container clusters, auto-scaling groups)
|
v
50_edge (CDN distributions, load balancers)
|
v
60_messaging (message brokers, event buses, queues)
|
v
70_monitoring (metrics collection, dashboards, alerting)
|
v
80_ci_cd (build runners)
|
v
90_shared_services (bastion hosts, service discovery)
Dependencies are strictly forward: a layer may reference any lower-numbered layer via remote state, but never a higher-numbered one. Layer 50 can read from layers 00, 10, or 40 -- but layer 50 cannot depend on layer 60. This ensures the deployment chain is always acyclic and any layer can be planned or applied without waiting for higher layers to exist.
The cardinal rule of Terraform state management: every layer in every environment gets its own state file. No exceptions. No "we will split it later." Split it now.
One state bucket per cloud account (state buckets use <org>-<env>-tfstate as an exception
to the labels module naming -- they are account-global and need globally unique names):
myorg-root-tfstate <-- Root/management account
myorg-security-tfstate <-- Security account
myorg-log-archive-tfstate <-- Log archive account
myorg-dev-tfstate <-- Development account
myorg-prod-tfstate <-- Production account
Within each bucket, one key per layer or service:
myorg-dev-tfstate/
network <-- 00_network/dev state
security <-- 10_security/dev state
storage <-- 20_storage/dev state
databases <-- 30_databases/dev state
compute <-- 40_compute/dev state
messaging <-- 60_messaging/dev state
monitoring <-- 70_monitoring/dev state
shared_services <-- 90_shared_services/dev state
myapp-api <-- Service-owned state (separate repo)
billing-service <-- Service-owned state (separate repo)
Every state bucket must have all four:
| Property | Setting | Why |
|---|---|---|
| Encryption | AES-256 server-side | State contains secrets (database passwords, API keys) |
| Versioning | Enabled | Recover from accidental state corruption or deletion |
| Locking | Enabled | Prevent concurrent applies that corrupt state |
| Public access | Blocked | State files are the keys to your kingdom |
Higher layers read outputs from lower layers using remote state data sources. This creates explicit, auditable dependency chains.
# 40_compute/dev/main.tf -- Compute reads from network and security
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "myorg-dev-tfstate"
key = "network"
region = "eu-west-1"
}
}
data "terraform_remote_state" "security" {
backend = "s3"
config = {
bucket = "myorg-dev-tfstate"
key = "security"
region = "eu-west-1"
}
}
locals {
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
private_subnets = data.terraform_remote_state.network.outputs.private_subnets
base_sg_ids = data.terraform_remote_state.security.outputs.base_security_group_ids
}
Dependency chain in practice:
compute reads from network + securitydatabases reads from network + securitymonitoring reads from compute + networkshared_services reads from networkedge reads from compute + securityBad: Monolithic state file
myorg-dev-tfstate/
everything <-- One state file for ALL infrastructure
Problems: blast radius is everything. One bad apply can destroy networking, databases, and compute simultaneously. Plans take minutes as Terraform refreshes hundreds of resources. Two engineers cannot work on different layers in parallel.
Good: State per layer per environment
myorg-dev-tfstate/
network <-- 42 resources, 15-second plan
security <-- 28 resources, 10-second plan
databases <-- 15 resources, 8-second plan
compute <-- 35 resources, 12-second plan
Benefits: blast radius limited to one layer. Plans are fast. Engineers work on different layers in parallel. Recovery from corruption affects only one layer.
Bad: Environment state mixed together
# One state file contains both dev and prod resources
resource "aws_vpc" "dev" { cidr_block = "10.0.0.0/16" }
resource "aws_vpc" "prod" { cidr_block = "10.1.0.0/16" }
Problems: a mistake in dev configuration can destroy prod resources. No way to restrict who can modify prod without restricting dev.
Good: Separate directories, separate state, separate permissions
00_network/
dev/ -> myorg-dev-tfstate/network
prod/ -> myorg-prod-tfstate/network
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| State backend | S3 bucket (use_lockfile) | GCS bucket (native locking) | Azure Blob Storage + lease locking |
| State encryption | AES-256 SSE-S3 or SSE-KMS | Default encryption (Google-managed or CMEK) | Storage Service Encryption (Microsoft-managed or CMK) |
| State locking | S3 native locking (use_lockfile = true) | GCS native locking | Blob lease locking |
| Remote state reference | terraform_remote_state with S3 backend | terraform_remote_state with GCS backend | terraform_remote_state with azurerm backend |
| Account isolation | AWS accounts via Organizations | GCP projects via folders | Azure subscriptions via Management Groups |
| State bucket per account | One S3 bucket per AWS account | One GCS bucket per GCP project | One Storage Account per Azure subscription |
Working implementations in examples/:
examples/numbered-layer-layout.md -- Complete directory structure for a global infrastructure repository with numbered layers, environment subdirectories, and backend configuration for each layerexamples/cross-layer-state-references.md -- Terraform configurations showing how the compute layer reads outputs from network and security layers via remote state, including the backend configuration and output definitionsWhen designing or reviewing repository and state architecture:
dev/, prod/)terraform_remote_state data sources, not hardcoded values<org>-<env>-tfstate)