From atum-workflows
Cloud architecture pattern library — Well-Architected Framework principles (operational excellence, security, reliability, performance efficiency, cost optimization, sustainability) from AWS / GCP / Azure / Oracle, multi-region deployment strategies (active-active, active-passive, pilot light, warm standby), high availability patterns (multi-AZ databases, load balancer health checks, circuit breakers, retry with exponential backoff and jitter), disaster recovery (RPO / RTO definition, backup strategies, cross-region replication, runbook documentation, DR testing cadence), cost optimization (Reserved Instances vs Savings Plans vs Spot, right-sizing, idle resource detection, FinOps practices, AWS Cost Anomaly Detection, GCP Recommender, Azure Advisor), serverless vs containers vs VMs decision framework, networking (VPC peering, Transit Gateway, Direct Connect / Interconnect / ExpressRoute, IPv6 dual-stack, private endpoints), identity (IAM least privilege, OIDC for CI/CD, AWS SSO / GCP IAM / Azure AD, federated identity), data residency and compliance (GDPR data location, HIPAA, FedRAMP), and the multi-cloud vs single-cloud trade-off. Use when designing a new system in the cloud, migrating from on-prem, choosing between AWS / GCP / Azure / Cloudflare for a specific workload, planning a DR strategy, or running a cost optimization exercise. Differentiates from terraform-patterns (infrastructure-as-code execution) and kubernetes-patterns (workload orchestration) by focusing on the architectural decisions that those tools then implement.
npx claudepluginhub arnwaldn/atum-plugins-collection --plugin atum-workflowsThis skill uses the workspace's default tool permissions.
Ce skill couvre les **décisions d'architecture cloud haut niveau** : choix de provider, stratégie HA / DR, networking, identity, FinOps. Il complète `terraform-patterns` (exécution) et `kubernetes-patterns` (workloads).
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Ce skill couvre les décisions d'architecture cloud haut niveau : choix de provider, stratégie HA / DR, networking, identity, FinOps. Il complète terraform-patterns (exécution) et kubernetes-patterns (workloads).
Règle de base : avant de coder une infra Terraform, savoir POURQUOI tu choisis cette infra. Architecture > implementation.
Workload type
├── Site statique (marketing, blog, docs)
│ └── Cloudflare Pages, Vercel, Netlify (jamais EC2/GKE)
├── App serverless event-driven (webhook, cron, GraphQL léger)
│ ├── AWS Lambda + API Gateway + DynamoDB
│ ├── Cloudflare Workers + D1 + R2
│ └── GCP Cloud Run + Firestore
├── App full-stack web traditionnelle (Next.js, Rails, Django)
│ ├── Vercel / Netlify (frontend Next.js)
│ ├── Railway / Fly.io / Render (backend simple)
│ └── AWS ECS Fargate / GCP Cloud Run (plus de contrôle)
├── Microservices conteneurisés à grosse échelle
│ └── EKS / GKE / AKS + service mesh
├── HPC / batch processing
│ └── AWS Batch, GCP Dataflow, Spot/Preemptible VMs
├── ML training / inference
│ ├── Training : AWS SageMaker, GCP Vertex AI, Modal, RunPod
│ └── Inference : AWS Bedrock, GCP Vertex, Replicate, Together AI
├── Stateful workloads à très grosse échelle (BDD massives)
│ └── EKS/GKE + cloud-managed DB (RDS, Cloud SQL, Cosmos DB)
└── On-prem / sovereign cloud
└── OpenShift, Rancher, Nutanix
Region: eu-west-3
├── AZ a: RDS Primary (eu-west-3a)
├── AZ b: RDS Standby synchrone (eu-west-3b)
└── AZ c: RDS Read Replica async (eu-west-3c)
Failover automatique : a → b en 60-90 secondes (Multi-AZ)
Internet
↓
Application Load Balancer (multi-AZ)
↓ (health check sur /health)
Target group:
├── instance-1 (eu-west-3a) ✅
├── instance-2 (eu-west-3b) ✅
└── instance-3 (eu-west-3c) ❌ (drained)
import time
import random
def call_with_retry(fn, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return fn()
except Exception as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Sans jitter, tous les clients retry exactement en même temps → thundering herd qui aggrave la panne.
| Strategy | RPO | RTO | Cost | Use case |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours-Days | $ | Non-critical, dev/test |
| Pilot Light | Minutes | Hours | $$ | Important non-critical |
| Warm Standby | Seconds | Minutes | $$$ | Production critical |
| Active-Active Multi-region | ~0 | ~0 | $$$$ | Mission critical |
Le RTO et RPO dépendent du business, pas de la technique. Demander à la direction.
Cadence backup:
├── Snapshots automatiques DB (toutes les heures)
├── Logical dump quotidien (pg_dump → S3)
├── Cross-region replication (S3 lifecycle)
└── Backup logs continu (PITR — point-in-time recovery)
Rétention:
├── Snapshots: 7 jours
├── Daily backups: 30 jours
├── Monthly backups: 12 mois
└── Yearly backups: 7 ans (compliance)
Test de restore obligatoire tous les trimestres minimum. Un backup non testé n'est pas un backup.
Region eu-west-3
└── VPC 10.0.0.0/16
├── AZ a (10.0.0.0/19)
│ ├── Public subnet 10.0.0.0/22 (ALB, NAT GW, Bastion)
│ ├── Private subnet 10.0.16.0/22 (App tier)
│ └── Database subnet 10.0.24.0/24 (RDS)
├── AZ b (10.0.32.0/19) — same structure
└── AZ c (10.0.64.0/19) — same structure
| Method | Use case |
|---|---|
| VPC Peering | 2 VPCs same region, simple |
| Transit Gateway (AWS) / Hub-and-Spoke (GCP NCC) | 5+ VPCs, multi-region |
| PrivateLink (AWS) / Service Connect (GCP) | Expose un service privé à d'autres VPCs |
| Direct Connect / Interconnect / ExpressRoute | On-prem ↔ cloud, dedicated bandwidth |
| VPN Site-to-Site | On-prem ↔ cloud, plus simple, moins cher |
Pour appeler S3 / DynamoDB / Cloud Storage sans passer par Internet :
Mauvais : arn:aws:iam::aws:policy/AdministratorAccess partout
Bon : policies custom avec Resource: et Condition: explicites
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/uploads/*",
"Condition": {
"IpAddress": { "aws:SourceIp": ["10.0.0.0/16"] }
}
}
]
}
GitHub Actions ──OIDC──> AWS IAM Role ──> AWS Resources
Pas de long-lived AWS access key dans les Github Secrets. Le token OIDC est short-lived (1h max).
Les users humains ne devraient JAMAIS avoir d'IAM users avec des access keys. Toujours SSO + temporary credentials.
AWS bill type:
├── Compute (EC2, ECS, Lambda) — 30-50%
├── Database (RDS, DynamoDB, ElastiCache) — 15-25%
├── Storage (S3, EBS) — 10-20%
├── Data transfer (egress, cross-AZ) — 5-15% ⚠️ souvent sous-estimé
├── Networking (NAT, ALB, VPN) — 5-10%
└── Other (CloudWatch, Route53, etc.) — 5%
Environment: prod | staging | dev
Project: api | frontend | analytics
Owner: team-platform | team-data
CostCenter: engineering | marketing
Sans tagging, impossible d'allouer les coûts par équipe / projet.
Pour : intégration profonde, coût opérationnel minimal, expertise concentrée Contre : vendor lock-in, single point of failure si AWS down
Pour : true HA, leverage in negotiation, regulatory (data sovereignty) Contre : complexité X10, coûts de gestion énormes, expertise rare, services least-common-denominator
Pour : DR ultra-robuste sans le coût opérationnel de l'actif-actif Contre : test compliqué, coût modéré
Recommandation : single-cloud par défaut. Multi-cloud uniquement si justifié par compliance, regulations, ou business critique.
terraform-patterns (ce plugin)kubernetes-patterns (ce plugin)security-expert (atum-compliance)compliance-expert (atum-compliance)penetration-tester (atum-compliance)ci-cd-engineer (atum-stack-backend)