From cloud-foundation-principles
This skill should be used when the user is addressing cloud resource sprawl, implementing cost attribution and tagging enforcement, setting up monitoring and alerting defaults, configuring drift detection for Terraform, designing lifecycle policies for storage and artifacts, or cleaning up after migrations. Covers resource cleanup discipline, cost center enforcement, monitoring with sensible defaults, scheduled drift detection, and lifecycle automation.
npx claudepluginhub oborchers/fractional-cto --plugin cloud-foundation-principlesThis skill uses the workspace's default tool permissions.
Cloud infrastructure degrades through entropy. Every manual change, every forgotten resource, every untagged instance, every disabled alarm that was never re-enabled -- these are small acts of disorder that compound into large, expensive, ungovernable messes. Nobody wakes up one morning with a $40,000 surprise bill. They get there through twelve months of "we'll clean that up later."
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Cloud infrastructure degrades through entropy. Every manual change, every forgotten resource, every untagged instance, every disabled alarm that was never re-enabled -- these are small acts of disorder that compound into large, expensive, ungovernable messes. Nobody wakes up one morning with a $40,000 surprise bill. They get there through twelve months of "we'll clean that up later."
Operational hygiene is not a project. It is a daily practice. It is the cloud infrastructure equivalent of washing dishes after every meal instead of letting them pile up for a week. The five pillars -- clean as you go, cost attribution, monitoring, drift detection, and lifecycle policies -- form a system where each reinforces the others.
The most expensive cloud resources are the ones nobody remembers creating. After every migration, every experiment, every proof-of-concept, clean up immediately. Not next sprint. Not after the launch. Now.
If a resource has served its purpose, delete it in the same week. Temporary resources that survive longer than one sprint become permanent. Permanent resources that nobody owns become liabilities.
Bad: "We'll clean up after the migration"
Week 1: Migrate service-A from old cluster to new cluster
Week 4: Migrate service-B
Week 8: Migrate service-C
Week 12: "We should clean up the old cluster"
Week 20: Old cluster is still running, costing $2,400/month
Week 52: Nobody remembers what the old cluster does. Too risky to delete.
Good: Clean up is part of the migration ticket
Ticket: Migrate service-A to new cluster
Subtask 1: Deploy service-A on new cluster
Subtask 2: Reroute traffic to new cluster
Subtask 3: Verify new deployment (48h monitoring)
Subtask 4: Delete old service-A resources <-- same ticket
Subtask 5: Verify old resources are gone <-- same ticket
| Resource Type | Typical Waste Pattern | Action |
|---|---|---|
| Old compute instances | Pre-migration servers still running | Terminate after migration verified |
| Unused load balancers | Created for testing, never deleted | Delete if no targets registered |
| Orphaned storage volumes | Detached from terminated instances | Snapshot (if needed) then delete |
| Stale DNS records | Point to decommissioned services | Remove or update |
| Unused security groups | Created per-service, service deleted | Delete if no attached resources |
| Old container images | Registry bloat from months of builds | Lifecycle policy (see Pillar 5) |
| Expired certificates | Renewed but old cert not cleaned up | Delete after renewal confirmed |
| Test/sandbox resources | "Temporary" resources from experiments | Weekly audit, auto-delete policy |
test-ec2-instance or temp-bucket-2.Unattributable costs are uncontrollable costs. Every resource must have an owner and a cost center. This is not optional tagging -- it is enforced at the infrastructure-as-code layer.
The canonical required tags list (owner, environment, project, cost_center, iac_managed) is defined in the naming-and-labeling-as-code skill. The labels module produces them automatically — engineers never type them manually.
Cost centers are validated at terraform plan time using a closed list defined in the labels module. The canonical cost center list and the pattern for defining company-specific domains live in the naming-and-labeling-as-code skill. Freeform tags are rejected before any resource is created -- a developer cannot accidentally create resources with cost_center = "test" or cost_center = "misc". The labels module rejects it before anything is provisioned.
| Frequency | Action |
|---|---|
| Weekly | Review cost anomaly alerts (>20% increase from baseline) |
| Monthly | Review cost by cost center and team, identify top 5 cost drivers |
| Quarterly | Full cost optimization review: right-sizing, reserved instances, unused resources |
Every service gets monitoring from the moment it is deployed. Not after the first incident. Not after someone asks "do we have alerting?" The monitoring module provides sensible defaults that work out of the box, with the ability to override thresholds per-service.
| Metric | Default Threshold | Rationale |
|---|---|---|
| CPU utilization | 80% | Leaves headroom for traffic spikes |
| Memory utilization | 85% | OOM kills are catastrophic; catch early |
| Disk/storage free | 10GB or 10% | Disk-full crashes databases and logging |
| HTTP 5xx error rate | > 1% of requests | Backend errors visible to users |
| Response latency (p95) | Service-defined | Varies by service; must be explicitly set |
| Health check failures | 2 consecutive | Avoid alerting on transient network blips |
| Database connections | 80% of max | Connection exhaustion cascades to all clients |
| Read latency | 100ms | Slow reads indicate query or index issues |
| Write latency | 1s | Slow writes indicate lock contention or disk issues |
Not every alert makes sense for every service. Use a threshold sentinel value of -1 to disable specific alarms without removing the monitoring module.
module "alerts" {
source = "git::https://github.com/myorg/tf-module-alerts.git?ref=v1.3.0"
service_name = "myapp-api"
alarm_email = "myapp-team@myorg.com"
# Use defaults for most thresholds
cpu_utilization_threshold = 80 # default
storage_free_threshold = 10 # default (GB)
# Disable network alerting (not relevant for this service)
network_in_threshold = -1
network_out_threshold = -1
# Custom threshold for this specific service
http_5xx_threshold = 0.5 # Stricter than default: alert at 0.5% error rate
}
Inside the module, the -1 sentinel disables alarm creation:
locals {
create_cpu_alarm = var.cpu_utilization_threshold >= 0
create_network_alarm = var.network_in_threshold >= 0
}
resource "aws_cloudwatch_metric_alarm" "cpu" {
count = local.create_cpu_alarm ? 1 : 0
# ... alarm configuration
}
Configure alarms to treat missing data as "not breaching." Services that scale to zero (serverless, spot instances) should not trigger alarms when no data is reported. This prevents alarm storms during expected idle periods.
Infrastructure drift occurs when someone modifies a resource outside of Terraform -- through the cloud console, a CLI command, or another automation tool. Drift is silent, invisible, and dangerous. The infrastructure your code describes and the infrastructure that actually exists diverge without anyone knowing.
Run terraform plan on a schedule (daily for production, weekly for development). Any planned changes on a clean state indicate drift -- someone changed something outside of Terraform. The terraform plan -detailed-exitcode flag is critical: exit code 0 means no changes (clean), exit code 2 means drift detected. Alert on exit code 2.
For a complete drift detection pipeline implementation (GitHub Actions workflow with matrix strategy across layers, alerting, and scheduling), see the unified-cicd-platform skill.
| Drift Type | Cause | Action |
|---|---|---|
| Security group rule added | Console change during incident | Import into Terraform or revert |
| Instance type changed | Manual right-sizing | Update Terraform to match or revert |
| Tag missing | Resource modified outside IaC | Re-apply Terraform to restore tags |
| Resource deleted | Manual cleanup without IaC update | Remove from Terraform state or recreate |
| New resource exists | Console-created, not in Terraform | Import into Terraform or delete |
Infrastructure not in code is a liability. Console-created resources will be deleted when discovered. If an emergency required a console change, the change must be imported into Terraform within 48 hours and documented in an ADR or incident report.
Storage, logs, artifacts, and snapshots accumulate silently. Without lifecycle policies, a $5/month logging bill becomes a $500/month logging bill within a year.
# S3 lifecycle policy for data ingestion buckets
resource "aws_s3_bucket_lifecycle_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA" # Infrequent access after 30 days
}
transition {
days = 90
storage_class = "GLACIER_IR" # Archive after 90 days
}
expiration {
days = 365 # Delete after 1 year
}
}
}
# Log group with explicit retention
resource "aws_cloudwatch_log_group" "service" {
name = "/ecs/${module.labels.prefix}myapp-api"
retention_in_days = 90 # Production logs: 90 days
# Dev logs: 14 days is sufficient
# retention_in_days = 14
}
Never create log groups without a retention policy. The default in most cloud providers is "retain forever," which means unbounded cost growth.
| Artifact Type | Retention Policy | Rationale |
|---|---|---|
| Container images | See container-image-tagging skill | Retention policy defined with full Terraform example |
| Database snapshots | 14 days automated, manual snapshots reviewed monthly | Compliance + cost control |
| Build artifacts | 30 days | Rarely needed after deployment verified |
| Terraform plan files | 7 days | Only needed during review cycle |
| Temporary uploads | 24 hours | Processing should be complete; auto-expire |
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Cost attribution | Cost Explorer + Cost Categories | Billing Reports + Labels | Cost Management + Tags |
| Cost anomaly detection | Cost Anomaly Detection | Budget Alerts | Cost Alerts |
| Monitoring alarms | CloudWatch Alarms | Cloud Monitoring Alerting Policies | Azure Monitor Alerts |
| Log retention | CloudWatch Logs retention_in_days | Cloud Logging retention settings | Log Analytics retention |
| Storage lifecycle | S3 Lifecycle Configuration | GCS Lifecycle Rules | Blob Lifecycle Management |
| Drift detection | terraform plan -detailed-exitcode | terraform plan -detailed-exitcode | terraform plan -detailed-exitcode |
| Compliance scanning | AWS Config Rules | Organization Policy Constraints | Azure Policy |
| Resource inventory | AWS Config Recorder | Cloud Asset Inventory | Azure Resource Graph |
Working implementations in examples/:
examples/monitoring-and-alerting-module.md -- Terraform monitoring module with sensible defaults, the -1 disable pattern, and missing-data-safe alarm configurations across compute, database, and HTTP servicesexamples/drift-detection-pipeline.md -- Scheduled CI/CD pipeline that runs terraform plan daily, detects drift via exit codes, and alerts the team with actionable contextWhen designing or reviewing operational hygiene practices:
naming-and-labeling-as-code skill for the canonical list)terraform plan time via a closed list in the labels module-1 sentinelterraform plan runs detect drift daily in production, weekly in developmentcontainer-image-tagging skill for retention rules)