Skill

infrastructure-architecture

This skill should be used when the user asks to "design cloud infrastructure", "plan network topology", "define HA/DR strategy", "set up cloud landing zones", or "optimize cloud costs". [EXPLICIT] Also triggers on mentions of VPC, Kubernetes, serverless, multi-AZ, IAM, reserved instances, chaos testing, or any compute/network/storage platform design. Use this skill even if the user only mentions a single infrastructure concern — the full platform context is always relevant. [EXPLICIT]

From jm-adk

Install

Run in your terminal

npx claudepluginhub javimontano/jm-adk-alfa

Tool Access

This skill is limited to using the following tools:

ReadWriteEditGlobGrepBash

Supporting Assets

View in Repository

agents/guardian.md

agents/lead.md

agents/specialist.md

agents/support.md

evals/evals.json

knowledge/body-of-knowledge.md

knowledge/knowledge-graph.md

prompts/meta.md

prompts/primary.md

prompts/variations/deep.md

prompts/variations/quick.md

references/infra-arch-patterns.md

templates/output.docx.md

templates/output.html

Skill Content

Infrastructure Architecture: Platform & Runtime Design

Infrastructure architecture designs where and how software runs — compute resources, network topology, data storage, high availability, disaster recovery, identity management, and cost optimization. It answers: "How do we provide a platform for applications?"

Principio Rector

La infraestructura invisible es la mejor infraestructura. La plataforma existe para que las aplicaciones corran — no para ser admirada. Se diseña para reliability, cost-efficiency, y self-service. Si los desarrolladores necesitan pedir tickets para desplegar, la infra falló en su misión.

Filosofía de Infraestructura

Infrastructure as Code, siempre. Si no está en código, no existe. Terraform, Pulumi, o CDK — nunca consolas manuales en producción. [EXPLICIT]
HA/DR no es opcional. Multi-AZ es el mínimo. RPO y RTO se definen ANTES del diseño, no después del primer incidente. [EXPLICIT]
FinOps desde Day 1. El costo de la nube no se controla al final — se diseña desde el principio. Reserved instances, right-sizing, y cost alerts son parte del diseño. [EXPLICIT]

Inputs

The user provides a system or platform name as $ARGUMENTS. Parse $1 as the platform/system name used throughout all output artifacts. [EXPLICIT]

Parameters:

{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
- piloto-auto: Auto para análisis de infra y network design, HITL para decisiones de HA/DR y cost commitments. [EXPLICIT]
- desatendido: Cero interrupciones. Infraestructura documentada automáticamente. Supuestos documentados. [EXPLICIT]
- supervisado: Autónomo con checkpoint en network topology, compute strategy, y cost optimization. [EXPLICIT]
- paso-a-paso: Confirma cada VPC, subnet, compute choice, storage tier, y cost recommendation. [EXPLICIT]
{FORMATO}: markdown (default) | html | dual
{VARIANTE}: ejecutiva (~40% — S1 network topology + S4 HA/DR + S7 cost optimization) | técnica (full 7 sections, default)

Before generating architecture, detect infrastructure context:

!find . -name "*.tf" -o -name "*.yaml" -o -name "Dockerfile" -o -name "*.hcl" | head -20

If reference materials exist, load them:

Read ${CLAUDE_SKILL_DIR}/references/cloud-patterns.md
Read ${CLAUDE_SKILL_DIR}/references/cost-models.md

When to Use

Designing cloud infrastructure (AWS, Azure, GCP) or on-premises platforms
Planning network topology (VPCs, subnets, firewalls, load balancers, CDN)
Defining HA/DR strategy (availability zones, failover, backup, recovery)
Designing IAM model (service accounts, roles, least privilege, zero trust)
Planning cost optimization (reserved instances, spot, auto-scaling, right-sizing)
Establishing cloud landing zones (account structure, guardrails, compliance)
Capacity planning (compute, storage, bandwidth for growth and peaks)

When NOT to Use

Internal software structure → metodologia-software-architecture
End-to-end solution design → metodologia-solutions-architecture
Enterprise portfolio alignment → metodologia-enterprise-architecture
Build pipelines and security controls → metodologia-devsecops-architecture

Delivery Structure: 7 Sections

S1: Network Topology

Design of network architecture ensuring connectivity, segmentation, security, and resilience. [EXPLICIT]

VPC/Network Architecture: Subnets by tier:

Public: load balancers, NAT gateways, bastion hosts
Private: application servers (no inbound from internet)
Protected: databases, sensitive data (no outbound)

Connectivity: Intra-region, inter-region, VPN, Direct Connect/dedicated circuits

Firewalls & Security Groups: Network ACLs (stateless), Security Groups (stateful), least privilege

Load Balancing: L4 (NLB), L7 (ALB), geographic (Route 53, CloudFront)

DNS & CDN: Public DNS, private DNS (service discovery), CDN (cache globally)

DDoS & WAF: Shield/Cloudflare for DDoS, WAF for application-layer attacks

S2: Compute & Containers

Strategy for running workloads — VMs, containers, or serverless. [EXPLICIT]

VMs: Full control, best for legacy/compliance. Trade-off: more management overhead. Containers (Docker/K8s): Standardized, portable. Best for microservices. Trade-off: orchestration complexity. Serverless: No infra management, pay per invocation. Best for event-driven. Trade-off: cold start, vendor lock-in, cost at scale.

Kubernetes Architecture (if containers):

Control plane (3+ nodes), worker nodes (auto-scale), add-ons (ingress, autoscaler, monitoring)

Auto-Scaling: Horizontal (stateless), vertical (stateful); metrics: CPU, memory, custom, queue depth Resource Limits: Requests (guaranteed), limits (max); balanced for predictability vs. flexibility

S3: Storage & Data

Data persistence — performance, reliability, cost. [EXPLICIT]

Block Storage: Virtual hard drives for IOPS-intensive workloads (databases) Object Storage: Distributed, durable, cheap at scale (backups, logs, media, data lake) File Storage: Shared filesystem (NFS) for multi-instance access

Database Hosting: Managed (RDS/Cloud SQL: less ops, more cost) vs. self-managed (full control, more ops)

Backup & DR:

RPO: acceptable data loss; RTO: acceptable downtime
Backup frequency vs. storage cost trade-off
Geographically separate backup location; restore testing mandatory

Data Tiering: Hot (SSD), warm (standard), cold (archive/Glacier); lifecycle policies for automatic transitions

S4: HA & Disaster Recovery (Multi-AZ, Chaos)

Strategy for surviving failures and maintaining continuity. [EXPLICIT]

Failure Modes:

Single instance → redundancy (replicas, failover)
Zone failure → multi-AZ deployment
Region failure → multi-region replication
Data corruption → immutable backups, point-in-time recovery
Application bug → blue-green, canary releases

Multi-Region: Active-passive (lower cost, longer RTO) vs. active-active (higher cost, low RTO, eventual consistency)

Failover Mechanisms: DNS-based, load balancer, database replica promotion; automatic vs. manual

Chaos Testing: Regularly kill instances, fail services, simulate zone failures. Tools: Gremlin, LitmusChaos, Chaos Monkey. Goal: validate assumptions before production incidents.

S5: IAM & Platform Security

Identity and access management for infrastructure resources. [EXPLICIT]

Identity Federation: SSO via SAML/OIDC, LDAP for legacy
Service Accounts: Short-lived credentials, least-privilege IAM roles
Secrets Management: Vault/AWS Secrets Manager/Azure Key Vault; centralized, rotated, audited
Network Segmentation: Public/private/protected tiers with explicit allow rules
Encryption: In transit (TLS 1.2+), at rest (storage, backups); key management with rotation
Compliance & Audit: CloudTrail/audit logs (immutable), compliance scanning, secrets scanning

S6: Cloud Landing Zone & Governance

Foundation for safe, scalable, compliant cloud deployment. [EXPLICIT]

Account Structure: Management (billing/guardrails), shared services (logging/monitoring/security), workload accounts (dev/staging/prod per app or team)

Guardrails: Preventive (SCPs: no public S3 buckets) + Detective (Config/Security Hub: monitor violations)

Tagging Strategy: Owner, environment, cost center, application, compliance. Enables: cost allocation, resource discovery, compliance audits.

Billing & Cost Allocation: Tag-based allocation, budgets & alerts, reserved instances, savings plans

Network: Hub-and-spoke (centralized shared VPC), Transit Gateway, central DNS

S7: Cost Optimization

Strategies for reducing cloud spend without sacrificing performance or reliability. [EXPLICIT]

Right-Sizing: Analyze utilization, downsize over-provisioned resources, review monthly
Commitment Discounts: Reserved Instances (30-70% off, steady-state), Savings Plans (flexible), Spot (70-90% off, fault-tolerant batch)
Managed vs. Self-Managed: Break-even analysis: when operational cost exceeds unit cost savings
Auto-Scaling: Scale down off-peak (save 30-50%), predictive scaling, cost-aware metrics
Storage Optimization: Delete unused resources, compress data (Parquet vs. CSV), lifecycle policies
Data Transfer: Minimize inter-AZ and NAT Gateway egress; CDN reduces origin bandwidth
Monitoring & Governance: Cost dashboards, anomaly detection, chargeback model

Trade-off Matrix

Decision	Enables	Constrains	When to Use
Multi-AZ	Survive zone failure	~2x cost, complexity	Critical workloads, availability SLA
Multi-Region	Survive region failure, global low latency	Very high cost, eventual consistency	Global app, strict RPO/RTO
RDS Managed DB	Less ops overhead	Higher cost, less control	Most workloads, HA required
Self-Managed DB	Control, potentially lower cost	High ops burden, backup responsibility	Specialized needs, sufficient ops team
Kubernetes	Flexibility, standard, portable	Ops complexity	Polyglot, stateless, K8s-experienced teams
Serverless	No infra management	Cold start, vendor lock-in, cost at scale	Event-driven, unpredictable load
Reserved Instances	30-70% discount	Inflexibility, upfront cost	Predictable, steady-state workloads
Spot Instances	70-90% discount	Interruption risk	Fault-tolerant batch, non-critical

Assumptions

Workload requirements understood (performance, availability, compliance)
Cloud platform chosen (AWS, Azure, GCP, or hybrid)
Budget exists for infrastructure
Team has cloud operations capability (or is building it)
Security and compliance requirements known
Infrastructure-as-code practices assumed (Terraform, CloudFormation)

Limits

Does not design application software (see metodologia-software-architecture)
Does not design end-to-end solutions (see metodologia-solutions-architecture)
Focuses on cloud infrastructure; on-premises design is parallel effort

Edge Cases

On-Premises to Cloud Migration: Existing workloads must move with minimal disruption. Hybrid period: on-prem and cloud coexist. Approach: strangler fig, VPN connectivity, staged migration. [EXPLICIT]

Multi-Cloud (AWS + Azure + GCP): No unified API; complexity increases significantly. Solution: abstraction layer (Kubernetes), consistent tagging, multi-cloud governance. [EXPLICIT]

Highly Regulated (Financial, Healthcare): Data residency: data cannot leave country/region. Dedicated accounts, encryption, audit trails, periodic assessment. [EXPLICIT]

Extreme Scale (Millions of Users): Handle 10x-100x load without degradation. Cost critical from start. Global infrastructure, caching at every level, spot for batch. [EXPLICIT]

Cost-Constrained Startup: Limited budget, unpredictable growth. Serverless where possible, auto-scaling, spot instances, avoid reserved instances initially. [EXPLICIT]

Validation Gate

Before finalizing delivery, verify:

Cross-References

metodologia-software-architecture: Defines application requirements that infrastructure must support
metodologia-solutions-architecture: Integration patterns constrain network topology; observability stack runs on infrastructure
metodologia-enterprise-architecture: Technology radar and governance guide infrastructure decisions
metodologia-devsecops-architecture: Pipeline deploys to infrastructure; security gates verify compliance

Output Format Protocol

Format	Default	Description
`markdown`	Yes	Rich Markdown + Mermaid diagrams. Token-efficient.
`html`	On demand	Branded HTML (Design System). Visual impact.
`dual`	On demand	Both formats.

Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter. [EXPLICIT]

Output Artifact

Primary: A-04_Infrastructure_Architecture_Deep.html — Executive summary, network topology, compute strategy, storage/database architecture, HA/DR strategy, IAM/security, cloud landing zone, cost optimization.

Secondary: Network diagram (VPC topology), auto-scaling policy, backup/recovery runbook, security compliance checklist, cost optimization quick wins.

Author: Javier Montaño | Last updated: 2026-03-12

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.5k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.