From cloud-foundation-principles
This skill should be used when the user is choosing between managed and self-hosted services, deciding whether to run Kubernetes or use managed containers, evaluating self-hosted databases vs managed databases, considering self-hosted monitoring or caches, designing for a small team (under 50 engineers), or justifying a self-hosted exception. Covers the operations tax of self-hosting, managed container orchestration over Kubernetes for small teams, managed workflow engines, managed caches and databases, managed monitoring, and the decision framework for when self-hosting is genuinely justified.
npx claudepluginhub oborchers/fractional-cto --plugin cloud-foundation-principlesThis skill uses the workspace's default tool permissions.
Self-hosting a database, a cache, a workflow engine, or a Kubernetes cluster is not free. It costs patching, backup verification, incident response at 3 AM, capacity planning, version upgrades, security hardening, and monitoring of the monitor. Each self-hosted service is an invisible full-time job. For a team of five engineers shipping a SaaS product, running your own PostgreSQL is the equival...
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Self-hosting a database, a cache, a workflow engine, or a Kubernetes cluster is not free. It costs patching, backup verification, incident response at 3 AM, capacity planning, version upgrades, security hardening, and monitoring of the monitor. Each self-hosted service is an invisible full-time job. For a team of five engineers shipping a SaaS product, running your own PostgreSQL is the equivalent of hiring a sixth engineer whose entire job is keeping PostgreSQL alive -- except you do not hire that person, so the work falls on everyone, and nobody does it well.
Managed services trade money for engineering time. For startups and small teams (under 50 engineers), this trade is almost always correct. The cloud bill goes up by hundreds of dollars per month; the engineering team gets back thousands of dollars in reclaimed time. Self-host only when the managed service genuinely cannot meet your requirements -- and document the justification in an ADR.
Every self-hosted service carries a recurring operations cost that is invisible until something breaks.
| Operations Task | Managed Service | Self-Hosted |
|---|---|---|
| OS/kernel patching | Provider handles it | You schedule downtime, test, apply |
| Version upgrades | One-click or automatic | You test, migrate, rollback-plan, execute |
| Backup & restore | Automated, point-in-time | You configure, verify, test restores quarterly |
| Scaling | Auto-scaling or single API call | You monitor, forecast, provision, rebalance |
| High availability | Built-in multi-AZ/region | You design, implement, test failover |
| Security hardening | Provider hardens, you configure | You harden OS, network, application, and runtime |
| Monitoring | Built-in metrics and logs | You deploy exporters, configure dashboards, set alerts |
| Incident response | Provider's SRE team + your config | Your team, 24/7, for infrastructure AND application |
| Compliance | Provider certifications (SOC2, HIPAA) | You certify the infrastructure yourself |
The compound effect: one self-hosted service is manageable. Three self-hosted services (database + cache + monitoring stack) consume 30-50% of a small team's operational capacity. Five self-hosted services and you are an infrastructure company that happens to also build a product.
Kubernetes is the most frequently self-hosted service that teams do not need. For teams under 50 engineers running fewer than 20 services, managed container platforms provide the same deployment model (containers, health checks, scaling, load balancing) without the operational overhead of cluster management, node pool sizing, ingress controller configuration, CNI plugin selection, and etcd maintenance.
| Criterion | Use Managed Containers | Use Kubernetes |
|---|---|---|
| Team size | Under 50 engineers | 50+ engineers with dedicated platform team |
| Service count | Under 20 services | 20+ services with complex networking |
| GPU workloads | No, or minimal | Heavy GPU scheduling requirements |
| Custom scheduling | Not needed | Custom schedulers, operators, CRDs required |
| Multi-cloud | Not required | Required for portability |
| Service mesh | Not needed | Istio/Linkerd required |
| Compliance | Standard | Requires specific K8s-level audit controls |
# Good: managed container service for a team of 8 engineers
resource "aws_ecs_service" "myapp" {
name = "myapp"
cluster = data.terraform_remote_state.compute.outputs.cluster_arn
task_definition = aws_ecs_task_definition.myapp.arn
desired_count = 2
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 50 # Increase to 100 for production-critical services
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 50 # Spot can be interrupted; suitable for dev, use cautiously in prod
}
deployment_circuit_breaker {
enable = true
rollback = true
}
# Zero-downtime rolling update
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
}
# Result: no nodes to patch, no cluster upgrades, no CNI plugins,
# no ingress controllers, no etcd backups. Deploy and forget.
# Bad: self-managed Kubernetes for the same team of 8
resource "aws_eks_cluster" "main" {
name = "myapp-cluster"
role_arn = aws_iam_role.eks.arn
version = "1.28" # You must upgrade this every 3-4 months
vpc_config {
subnet_ids = var.private_subnet_ids
}
}
resource "aws_eks_node_group" "workers" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "workers"
instance_types = ["m5.large"]
scaling_config {
desired_size = 3
max_size = 6
min_size = 2
}
# Now you also need: ingress-nginx, cert-manager, external-dns,
# metrics-server, cluster-autoscaler, aws-load-balancer-controller,
# and someone to upgrade all of them every quarter.
}
Self-hosted workflow engines (Airflow on EC2/K8s, Temporal self-hosted, Prefect server) require database backends, worker scaling, scheduler high availability, log aggregation, and web UI hosting. Managed workflow services handle all of this.
| Approach | What You Manage | What the Provider Manages |
|---|---|---|
| Managed Airflow | DAG code, connections, variables | Scheduler HA, worker scaling, web UI, database, upgrades |
| Self-hosted Airflow | DAG code, connections, variables, scheduler HA, worker scaling, web UI, metadata DB, Redis/Celery, upgrades, monitoring | Nothing |
| Managed step functions | Workflow definitions | Execution, scaling, retry, logging, state persistence |
| Self-hosted Temporal | Workflow code, namespace management, history DB, visibility DB, upgrades, monitoring | Nothing |
The breaking point: self-hosted Airflow is three services (scheduler, webserver, workers), a metadata database, a message broker, and a log storage backend. That is six components to keep alive for a workflow engine that is supposed to keep your other workflows alive.
Do your research first: managed workflow services vary significantly in quality. Sometimes your cloud provider's offering (e.g., MWAA) is the right choice; sometimes a specialized third-party provider (e.g., Astronomer for Airflow) offers a materially better experience. Evaluate both before committing.
There is almost no scenario where a startup or small team should run a self-hosted database or cache in production. The managed service gives you automated backups, point-in-time recovery, failover, patching, and monitoring for a modest premium over the raw compute cost.
# Good: managed database with automated operations
resource "aws_db_instance" "myapp" {
identifier = "${module.labels.prefix}myapp-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t4g.medium"
multi_az = true # Automatic failover
backup_retention_period = 14 # 14-day point-in-time recovery
auto_minor_version_upgrade = true # Security patches applied automatically
storage_encrypted = true
performance_insights_enabled = true # Built-in query monitoring
deletion_protection = true
}
# Bad: self-hosted PostgreSQL on an EC2 instance
resource "aws_instance" "postgres" {
ami = "ami-0abcdef1234567890"
instance_type = "m5.large"
# Now you must:
# - Install and configure PostgreSQL
# - Set up streaming replication for HA
# - Configure automated backups to object storage
# - Test backup restores quarterly
# - Apply OS security patches monthly
# - Apply PostgreSQL patches on your schedule
# - Monitor replication lag, connections, disk, memory
# - Handle failover manually or build automation
# - Manage SSL certificates for connections
# - None of this is in the Terraform above
}
The same logic applies to caches. A managed Redis/Valkey instance with automatic failover, patching, and backup costs marginally more than the equivalent EC2 instance and saves dozens of hours per quarter in operational toil.
Self-hosted monitoring stacks (Prometheus + Grafana + Alertmanager + Loki) are four services that each need their own storage, scaling, and high availability. When your monitoring is down, you are blind to everything else being down. Managed monitoring services eliminate this circular dependency.
| Component | Self-Hosted | Managed Alternative |
|---|---|---|
| Metrics collection | Prometheus (+ storage, HA, federation) | Managed Prometheus / cloud metrics |
| Visualization | Grafana (+ database, auth, HA) | Managed Grafana / cloud dashboards |
| Alerting | Alertmanager (+ dedup, routing, HA) | Cloud alerting / managed alert rules |
| Log aggregation | Loki or ELK (+ storage, retention, indexing) | Cloud logging service |
The irony of self-hosted monitoring: the one service that must be available when everything else is failing is the one you built yourself on the same infrastructure that is failing. Managed monitoring runs on the provider's infrastructure, independent of your workloads.
Self-hosting is justified when -- and only when -- the managed service genuinely cannot meet a hard requirement. Document every exception in an ADR with this structure:
If your platform team would not accept the operational burden of maintaining it, do not self-host it. Use the managed service -- that is the paved road. Self-hosted Kubernetes needs a dedicated platform engineer. Self-hosted monitoring needs an observability engineer. If those roles do not exist on your team, the managed equivalent is the correct choice.
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Managed containers (standard) | ECS Fargate | Cloud Run / GKE Autopilot | Container Apps |
| Managed containers (GPU) | ECS with EC2 capacity providers | GKE with GPU node pools | AKS with GPU node pools |
| Managed Kubernetes | EKS (if you must) | GKE Autopilot | AKS |
| Managed PostgreSQL | RDS PostgreSQL / Aurora | Cloud SQL / AlloyDB | Azure Database for PostgreSQL |
| Managed Redis/cache | ElastiCache / MemoryDB | Memorystore | Azure Cache for Redis |
| Managed workflow engine | MWAA (Airflow) / Step Functions | Cloud Composer / Workflows | (no direct Airflow equivalent) / Logic Apps |
| Managed Prometheus | Amazon Managed Prometheus | Cloud Monitoring (built-in) | Azure Monitor (Prometheus) |
| Managed Grafana | Amazon Managed Grafana | Cloud Monitoring dashboards | Azure Managed Grafana |
| Managed log aggregation | CloudWatch Logs | Cloud Logging | Azure Monitor Logs |
Working implementations in examples/:
examples/managed-container-service.md -- Complete managed container deployment with spot/preemptible capacity, circuit breaker rollback, auto-scaling, and zero-downtime rolling updates -- no cluster management requiredexamples/managed-data-stack.md -- Production-grade managed database and cache with automated backups, failover, encryption, and monitoring -- contrasted against the self-hosted equivalent to illustrate the operations taxWhen designing or reviewing service hosting decisions: