Skill

operational-hygiene

Enforces cloud operational hygiene: resource cleanup after migrations, cost attribution/tagging, monitoring/alerting defaults, Terraform drift detection, lifecycle policies for storage/artifacts.

Terraform

devops

infrastructure

npx claudepluginhub oborchers/fractional-cto --plugin cloud-foundation-principles

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/cloud-foundation-principles:operational-hygiene

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Cloud infrastructure degrades through entropy. Every manual change, every forgotten resource, every untagged instance, every disabled alarm that was never re-enabled -- these are small acts of disorder that compound into large, expensive, ungovernable messes. Nobody wakes up one morning with a $40,000 surprise bill. They get there through twelve months of "we'll clean that up later."

Supporting Files

examples/drift-detection-pipeline.mdexamples/monitoring-and-alerting-module.md

SKILL.md

262 lines · ~3.5k tokens

Similar Skills

aws-wa-operational-excellence

Reviews AWS Well-Architected Operational Excellence pillar in IaC code, checking CloudWatch logs/alarms, X-Ray tracing, resource tagging, and runbooks for monitoring, logging, and automation.

4 files

security

Infrastructure as Code

Enforces CDK/CloudFormation best practices for immutable infrastructure, environment parity, least privilege, tagging, and cost optimization. Use when provisioning or modifying AWS infrastructure.

build-like-amazon

devops

Infrastructure engineering discipline: infrastructure-as-code principles, deliverable quality standards, environment parity, change management, security posture, observability, incident response, policy-as-code, supply chain integrity, and disaster recovery. Invoke whenever task involves any interaction with infrastructure work — provisioning, configuring, deploying, monitoring, or operating infrastructure systems.

6 files

infrastructure

Stats

LanguagePython

Parent stars12

Parent forks3

MaintenanceGood

Last CommitMar 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

The Only Force That Defeats Cloud Entropy Is Enforced Discipline

Operational hygiene is not a project. It is a daily practice. It is the cloud infrastructure equivalent of washing dishes after every meal instead of letting them pile up for a week. The five pillars -- clean as you go, cost attribution, monitoring, drift detection, and lifecycle policies -- form a system where each reinforces the others.

Pillar 1: Clean As You Go

The most expensive cloud resources are the ones nobody remembers creating. After every migration, every experiment, every proof-of-concept, clean up immediately. Not next sprint. Not after the launch. Now.

The Rule

If a resource has served its purpose, delete it in the same week. Temporary resources that survive longer than one sprint become permanent. Permanent resources that nobody owns become liabilities.

Good vs. Bad Patterns

Bad: "We'll clean up after the migration"

Week 1:  Migrate service-A from old cluster to new cluster
Week 4:  Migrate service-B
Week 8:  Migrate service-C
Week 12: "We should clean up the old cluster"
Week 20: Old cluster is still running, costing $2,400/month
Week 52: Nobody remembers what the old cluster does. Too risky to delete.

Good: Clean up is part of the migration ticket

Ticket: Migrate service-A to new cluster
  Subtask 1: Deploy service-A on new cluster
  Subtask 2: Reroute traffic to new cluster
  Subtask 3: Verify new deployment (48h monitoring)
  Subtask 4: Delete old service-A resources    <-- same ticket
  Subtask 5: Verify old resources are gone     <-- same ticket

Common Cleanup Targets

Resource Type	Typical Waste Pattern	Action
Old compute instances	Pre-migration servers still running	Terminate after migration verified
Unused load balancers	Created for testing, never deleted	Delete if no targets registered
Orphaned storage volumes	Detached from terminated instances	Snapshot (if needed) then delete
Stale DNS records	Point to decommissioned services	Remove or update
Unused security groups	Created per-service, service deleted	Delete if no attached resources
Old container images	Registry bloat from months of builds	Lifecycle policy (see Pillar 5)
Expired certificates	Renewed but old cert not cleaned up	Delete after renewal confirmed
Test/sandbox resources	"Temporary" resources from experiments	Weekly audit, auto-delete policy

The Don'ts List (Post to Your Team Channel)

Do not create infrastructure in the cloud console. Console-created resources are invisible to Terraform and will be deleted when discovered.
Do not give arbitrary names like test-ec2-instance or temp-bucket-2.
Do not leave temporary resources running overnight without a documented expiration.
Do not share one database across multiple services.
Do not mix development and production data in the same environment.

Pillar 2: Cost Attribution

Unattributable costs are uncontrollable costs. Every resource must have an owner and a cost center. This is not optional tagging -- it is enforced at the infrastructure-as-code layer.

Required Tags on Every Resource

The canonical required tags list (owner, environment, project, cost_center, iac_managed) is defined in the naming-and-labeling-as-code skill. The labels module produces them automatically — engineers never type them manually.

Enforcement in Code

Cost centers are validated at terraform plan time using a closed list defined in the labels module. The canonical cost center list and the pattern for defining company-specific domains live in the naming-and-labeling-as-code skill. Freeform tags are rejected before any resource is created -- a developer cannot accidentally create resources with cost_center = "test" or cost_center = "misc". The labels module rejects it before anything is provisioned.

Cost Review Cadence

Frequency	Action
Weekly	Review cost anomaly alerts (>20% increase from baseline)
Monthly	Review cost by cost center and team, identify top 5 cost drivers
Quarterly	Full cost optimization review: right-sizing, reserved instances, unused resources

Pillar 3: Monitoring with Sensible Defaults

Every service gets monitoring from the moment it is deployed. Not after the first incident. Not after someone asks "do we have alerting?" The monitoring module provides sensible defaults that work out of the box, with the ability to override thresholds per-service.

Default Thresholds

Metric	Default Threshold	Rationale
CPU utilization	80%	Leaves headroom for traffic spikes
Memory utilization	85%	OOM kills are catastrophic; catch early
Disk/storage free	10GB or 10%	Disk-full crashes databases and logging
HTTP 5xx error rate	> 1% of requests	Backend errors visible to users
Response latency (p95)	Service-defined	Varies by service; must be explicitly set
Health check failures	2 consecutive	Avoid alerting on transient network blips
Database connections	80% of max	Connection exhaustion cascades to all clients
Read latency	100ms	Slow reads indicate query or index issues
Write latency	1s	Slow writes indicate lock contention or disk issues

The Disable Pattern

Not every alert makes sense for every service. Use a threshold sentinel value of -1 to disable specific alarms without removing the monitoring module.

module "alerts" {
  source = "git::https://github.com/myorg/tf-module-alerts.git?ref=v1.3.0"

  service_name = "myapp-api"
  alarm_email  = "myapp-team@myorg.com"

  # Use defaults for most thresholds
  cpu_utilization_threshold = 80    # default
  storage_free_threshold    = 10    # default (GB)

  # Disable network alerting (not relevant for this service)
  network_in_threshold  = -1
  network_out_threshold = -1

  # Custom threshold for this specific service
  http_5xx_threshold = 0.5  # Stricter than default: alert at 0.5% error rate
}

Inside the module, the -1 sentinel disables alarm creation:

locals {
  create_cpu_alarm     = var.cpu_utilization_threshold >= 0
  create_network_alarm = var.network_in_threshold >= 0
}

resource "aws_cloudwatch_metric_alarm" "cpu" {
  count = local.create_cpu_alarm ? 1 : 0
  # ... alarm configuration
}

Missing Data Handling

Configure alarms to treat missing data as "not breaching." Services that scale to zero (serverless, spot instances) should not trigger alarms when no data is reported. This prevents alarm storms during expected idle periods.

Pillar 4: Drift Detection

Infrastructure drift occurs when someone modifies a resource outside of Terraform -- through the cloud console, a CLI command, or another automation tool. Drift is silent, invisible, and dangerous. The infrastructure your code describes and the infrastructure that actually exists diverge without anyone knowing.

Scheduled Plan Detection

Run terraform plan on a schedule (daily for production, weekly for development). Any planned changes on a clean state indicate drift -- someone changed something outside of Terraform. The terraform plan -detailed-exitcode flag is critical: exit code 0 means no changes (clean), exit code 2 means drift detected. Alert on exit code 2.

For a complete drift detection pipeline implementation (GitHub Actions workflow with matrix strategy across layers, alerting, and scheduling), see the unified-cicd-platform skill.

What Drift Indicates

Drift Type	Cause	Action
Security group rule added	Console change during incident	Import into Terraform or revert
Instance type changed	Manual right-sizing	Update Terraform to match or revert
Tag missing	Resource modified outside IaC	Re-apply Terraform to restore tags
Resource deleted	Manual cleanup without IaC update	Remove from Terraform state or recreate
New resource exists	Console-created, not in Terraform	Import into Terraform or delete

The Policy

Infrastructure not in code is a liability. Console-created resources will be deleted when discovered. If an emergency required a console change, the change must be imported into Terraform within 48 hours and documented in an ADR or incident report.

Pillar 5: Lifecycle Policies

Storage, logs, artifacts, and snapshots accumulate silently. Without lifecycle policies, a $5/month logging bill becomes a $500/month logging bill within a year.

Storage Tiering

# S3 lifecycle policy for data ingestion buckets
resource "aws_s3_bucket_lifecycle_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"    # Infrequent access after 30 days
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"     # Archive after 90 days
    }

    expiration {
      days = 365                        # Delete after 1 year
    }
  }
}

Log Retention

# Log group with explicit retention
resource "aws_cloudwatch_log_group" "service" {
  name              = "/ecs/${module.labels.prefix}myapp-api"
  retention_in_days = 90    # Production logs: 90 days

  # Dev logs: 14 days is sufficient
  # retention_in_days = 14
}

Never create log groups without a retention policy. The default in most cloud providers is "retain forever," which means unbounded cost growth.

Artifact Cleanup

Artifact Type	Retention Policy	Rationale
Container images	See `container-image-tagging` skill	Retention policy defined with full Terraform example
Database snapshots	14 days automated, manual snapshots reviewed monthly	Compliance + cost control
Build artifacts	30 days	Rarely needed after deployment verified
Terraform plan files	7 days	Only needed during review cycle
Temporary uploads	24 hours	Processing should be complete; auto-expire

Cloud Provider Translation

Concept	AWS	GCP	Azure
Cost attribution	Cost Explorer + Cost Categories	Billing Reports + Labels	Cost Management + Tags
Cost anomaly detection	Cost Anomaly Detection	Budget Alerts	Cost Alerts
Monitoring alarms	CloudWatch Alarms	Cloud Monitoring Alerting Policies	Azure Monitor Alerts
Log retention	CloudWatch Logs retention_in_days	Cloud Logging retention settings	Log Analytics retention
Storage lifecycle	S3 Lifecycle Configuration	GCS Lifecycle Rules	Blob Lifecycle Management
Drift detection	`terraform plan -detailed-exitcode`	`terraform plan -detailed-exitcode`	`terraform plan -detailed-exitcode`
Compliance scanning	AWS Config Rules	Organization Policy Constraints	Azure Policy
Resource inventory	AWS Config Recorder	Cloud Asset Inventory	Azure Resource Graph

Examples

Working implementations in examples/:

examples/monitoring-and-alerting-module.md -- Terraform monitoring module with sensible defaults, the -1 disable pattern, and missing-data-safe alarm configurations across compute, database, and HTTP services
examples/drift-detection-pipeline.md -- Scheduled CI/CD pipeline that runs terraform plan daily, detects drift via exit codes, and alerts the team with actionable context

Review Checklist

When designing or reviewing operational hygiene practices:

operational-hygiene

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

operational-hygiene

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

The Only Force That Defeats Cloud Entropy Is Enforced Discipline

Pillar 1: Clean As You Go

The Rule

Good vs. Bad Patterns

Common Cleanup Targets

The Don'ts List (Post to Your Team Channel)

Pillar 2: Cost Attribution

Required Tags on Every Resource

Enforcement in Code

Cost Review Cadence

Pillar 3: Monitoring with Sensible Defaults

Default Thresholds

The Disable Pattern

Missing Data Handling

Pillar 4: Drift Detection

Scheduled Plan Detection

What Drift Indicates

The Policy

Pillar 5: Lifecycle Policies

Storage Tiering

Log Retention

Artifact Cleanup

Cloud Provider Translation

Examples

Review Checklist

Similar Skills

Help us improve

The Only Force That Defeats Cloud Entropy Is Enforced Discipline

Pillar 1: Clean As You Go

The Rule

Good vs. Bad Patterns

Common Cleanup Targets

The Don'ts List (Post to Your Team Channel)

Pillar 2: Cost Attribution

Required Tags on Every Resource

Enforcement in Code

Cost Review Cadence

Pillar 3: Monitoring with Sensible Defaults

Default Thresholds

The Disable Pattern

Missing Data Handling

Pillar 4: Drift Detection

Scheduled Plan Detection

What Drift Indicates

The Policy

Pillar 5: Lifecycle Policies

Storage Tiering

Log Retention

Artifact Cleanup

Cloud Provider Translation

Examples

Review Checklist