Help us improve
Share bugs, ideas, or general feedback.
From infrastructure
Infrastructure engineering discipline: infrastructure-as-code principles, deliverable quality standards, environment parity, change management, security posture, observability, incident response, policy-as-code, supply chain integrity, and disaster recovery. Invoke whenever task involves any interaction with infrastructure work — provisioning, configuring, deploying, monitoring, or operating infrastructure systems.
npx claudepluginhub xobotyi/cc-foundry --plugin infrastructureHow this skill is triggered — by the user, by Claude, or both
Slash command
/infrastructure:devopsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Declarative, reproducible, observable, secure by default.**
Sets up production DevOps infrastructure: Docker containerization with Dockerfiles and docker-compose, CI/CD pipelines, Terraform IaC for cloud provisioning, and monitoring. For deploying apps.
Analyzes Terraform, CloudFormation, and Pulumi configurations for module structure, state management, drift prevention, and security posture.
Enforces CDK/CloudFormation best practices for immutable infrastructure, environment parity, least privilege, tagging, and cost optimization. Use when provisioning or modifying AWS infrastructure.
Share bugs, ideas, or general feedback.
Declarative, reproducible, observable, secure by default.
Every infrastructure failure traces to one of four root causes:
This skill prevents all four.
${CLAUDE_SKILL_DIR}/references/iac-principles.md]: GitOps v1.0 spec, 12-factor methodology,
state management comparison (stateful/stateless/GitOps), configuration design patterns, IaC maturity data${CLAUDE_SKILL_DIR}/references/observability.md]: Three pillars deep dive, SLI/SLO/error budget
framework, DORA metrics, toil reduction patterns, common observability gaps, sampling strategies${CLAUDE_SKILL_DIR}/references/change-management.md]: Deployment strategy comparison table,
release engineering principles, rollback requirements checklist, progressive delivery tooling${CLAUDE_SKILL_DIR}/references/security-posture.md]: Zero trust NIST domains, JIT access
patterns, certificate lifecycle, machine identity governance, supply chain integrity, policy-as-code tooling${CLAUDE_SKILL_DIR}/references/disaster-recovery.md]: RTO/RPO sizing by business impact,
service tiering, recovery architecture trade-offs, chaos engineering approaches and tooling${CLAUDE_SKILL_DIR}/references/testing.md]: Testing pyramid layers with tool lists, IaC quality
metrics, runbook format and automation maturity levels, incident response patternsThese apply to all infrastructure work regardless of tool.
Pinning example:
image: nginx:latest / version: ">=2.0" / hashicorp/consul:*image: nginx:1.25.4@sha256:6a5... / version: "2.3.1" / hashicorp/consul:1.17.2${CLAUDE_SKILL_DIR}/references/security-posture.md] for supply chain integrity requirementsSecrets example:
db_password: "hunter2" in a YAML file committed to gitENV DB_PASS=hunter2 baked into a Dockerfiledb_password: "{{ vault_db_password }}" with Ansible Vaultpassword_file: /run/secrets/db_pass with Docker secrets or SOPSBlast radius example:
See [${CLAUDE_SKILL_DIR}/references/change-management.md] for strategy comparison tables and rollback requirements.
Drift — divergence between declared state and actual state — is the single most common source of deployment failures. Less than one-third of organizations continuously monitor for drift; the rest discover it during outages.
Sources of drift: manual changes ("quick fixes" via CLI/console), failed partial applies, external system changes, provider API updates, auto-scalers and controllers modifying resources legitimately.
Detection approaches by tool type:
absent declarations for cleanupDrift example:
Reconciliation safety: Configure exclusion lists before enabling auto-sync. Controllers (HPA, cert-manager, Crossplane) legitimately modify resources — auto- remediating those changes creates reconciliation loops that destabilize the cluster. Start with detection + alerting, then enable auto-remediation after confidence in exclusion lists is established.
Three gaps to close between development and production:
Unavoidable differences (scale, domain names, credentials) are handled through environment-specific variables, not through separate code paths. If dev and prod use different infrastructure definitions, they will diverge.
Parity example:
Indicators of parity problems:
Preventing drift: Use IaC to provision all environments from the same definitions. Use containerization to package applications with their dependencies. Use continuous drift detection (state file comparison or GitOps reconciliation) to catch divergence before it causes incidents. Treat environments as immutable — replace rather than patch. The four pillars of parity: standardization, immutability, automation, observability.
Every infrastructure change must pass:
tflint,
checkov, ansible-lint, trivy, kics${CLAUDE_SKILL_DIR}/references/disaster-recovery.md]Watch for these signals of degrading infrastructure code quality:
ansible-lint profiles progress through min → basic → moderate
→ safety → shared → production)Every alert with a rote, algorithmic response must be automated. If the response to a page is always the same steps, that response is a candidate for automated remediation.
Blameless postmortems: Focus on system flaws, not human error. "What went wrong" not "who caused it." Document: timeline, root cause, contributing factors, action items to prevent recurrence. Every postmortem produces concrete action items — not just "be more careful."
Toil management: Toil is operational work that is manual, repetitive, automatable, tactical, without enduring value, and grows linearly with service size. Target: less than 50% of time on toil. If a team exceeds this, step back and automate. Use SLOs to make data-driven decisions — ignore certain operational tasks if doing so does not consume the error budget.
Self-service reduces toil: Move 80-90% of routine requests to self-service via web forms, scripts, APIs, or pull requests to configuration files. Reserve human handling for edge cases while automating the common path.
See [${CLAUDE_SKILL_DIR}/references/testing.md] for runbook automation maturity levels and incident response patterns.
Infrastructure documentation is part of the deliverable. Undocumented infrastructure is a liability — it becomes opaque to everyone except the person who built it, and eventually to them too.
Every infrastructure component must have:
Runbooks are specific commands, not prose. Every step must be executable without interpretation:
systemctl restart caddy && curl -f http://localhost:80/healthpsql -h localhost -U app -c "SELECT 1" && echo "DB OK" || echo "DB FAIL"Runbook automation maturity: Start with manual runbooks. Once a procedure is validated through repeated execution,
convert to semi-automated (scripts with human oversight at decision points), then fully automated (event-driven, no
human intervention) for rote responses to known alerts. Every automated runbook must produce an audit trail and include
a rollback path. See [${CLAUDE_SKILL_DIR}/references/testing.md] for automation patterns and success check
requirements.
When writing infrastructure: Apply all principles silently. Use the deliverable checklist as a completion gate — do not declare work done until every item is checked. Flag security, observability, or recoverability gaps as blocking issues, not suggestions.
When reviewing infrastructure: Cite the specific principle violated. Show the fix inline — not just what's wrong, but what correct looks like. Prioritize: security gaps first, then observability gaps, then drift/parity issues, then documentation gaps.
This skill defines the discipline that all infrastructure tool skills assume.
Workflow:
ansible — playbook design, role structure, inventory, vaultcontainers — Docker/Podman, Compose, image optimizationproxmox — VM/LXC provisioning, storage, networking, clusteringunraid — arrays, Docker, VMs, shares, pluginsnetworking — VLANs, firewalls, DNS, reverse proxies, VPN, TLSThe tool skills handle how to implement. This skill handles what good looks like.
IMPORTANT: Before declaring any infrastructure task complete, return to the deliverable checklist above and verify every item. An unchecked item is an unfinished deliverable.