Skill

nw-infrastructure-and-observability

Infrastructure as Code patterns (Terraform, Kubernetes), observability design (SLOs, metrics, alerting, dashboards), and pipeline security stages. Load when designing infrastructure, observability, or security scanning.

From nw
Install
1
Run in your terminal
$
npx claudepluginhub nwave-ai/nwave --plugin nw
Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Infrastructure as Code and Observability

Terraform Patterns

Module Structure

main.tf (resource definitions) | variables.tf (input declarations) | outputs.tf (output declarations) | versions.tf (provider/terraform version constraints) | README.md (module docs).

State Management

Remote backend: S3/GCS/Azure Blob with state locking. State locking: DynamoDB/Cloud Storage/Azure Blob lease. Workspace strategy: one workspace per environment (dev/staging/prod).

Security

Never commit secrets -- use secret managers | Encrypt state at rest | Use OIDC for CI/CD auth | Least privilege IAM roles.

IaC Principles (Kief Morris)

Reproducibility (same input, same output) | Idempotency (safe to run multiple times) | Immutability (replace, do not modify) | Version control (track all changes).

IaC Patterns

  • Stack pattern: Complete infrastructure as single unit
  • Library pattern: Reusable infrastructure modules
  • Pipeline pattern: Infrastructure changes through CI/CD

Kubernetes Patterns

Core Concepts

Pods | Deployments | Services | Ingress | ConfigMaps | Secrets | PersistentVolumes | RBAC | NetworkPolicies | PodSecurityPolicies | Operators | Custom Resources | Controllers.

Production Patterns

Multi-tenancy with namespaces | Resource quotas and limits | Pod disruption budgets | Horizontal and vertical autoscaling.

Deployment Template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .name }}
  labels:
    app: {{ .name }}
    version: {{ .version }}
spec:
  replicas: {{ .replicas }}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: {{ .name }}
        image: {{ .image }}:{{ .tag }}
        resources:
          requests:
            memory: {{ .memoryRequest }}
            cpu: {{ .cpuRequest }}
          limits:
            memory: {{ .memoryLimit }}
            cpu: {{ .cpuLimit }}
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

HPA Template

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .name }}
  minReplicas: {{ .minReplicas }}
  maxReplicas: {{ .maxReplicas }}
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Observability Design

SLO Design

Availability SLO: successful_requests / total_requests * 100

  • 99.9% = 8.76h downtime/year | 99.95% = 4.38h | 99.99% = 52.6min
  • Error budget = 100% - SLO target

Latency SLO: requests_under_threshold / total_requests * 100

  • 99% of requests < 200ms | 99.9% of requests < 1000ms

Metrics Methods

RED Method (request-driven services): Rate (requests/sec) | Errors (error rate %) | Duration (latency p50, p90, p99).

USE Method (resources -- CPU, memory, disk): Utilization (% used) | Saturation (queue depth, waiting requests) | Errors (error counts).

Four Golden Signals (Google SRE): Latency | Traffic | Errors | Saturation.

SLO-Based Alerting

  • Fast burn: >14.4x burn rate for 1 hour -> page
  • Slow burn: >6x burn rate for 6 hours -> ticket
  • Budget nearly exhausted: >50% consumed -> warning

Alert structure: alertname | severity | service | SLO name | current value | threshold | runbook URL | dashboard URL.

Dashboard Design (per service)

Request rate (RPS) | Error rate (%) | Latency distribution (p50, p90, p99) | SLO status and error budget | Resource utilization (CPU, memory) | Dependency health.

Three Pillars of Observability (Charity Majors)

  • Logs: Event records with structured context. Use structured logging with correlation IDs.
  • Metrics: Numeric measurements over time. Use RED/USE/Golden Signals.
  • Traces: Request flow across services. Use distributed tracing with sampling.

Principles: high cardinality is essential | debug in production | understand unknown unknowns.

Pipeline Security

Security Stages

Pre-commit: Secrets scanning (pre-commit hooks) | linting. Tools: pre-commit | gitleaks | detect-secrets.

Commit stage: SAST | dependency scanning (SCA) | license compliance | secrets scanning. Tools: Semgrep/CodeQL/Bandit/SonarQube (SAST) | Dependabot/Snyk/Trivy (SCA) | Gitleaks/TruffleHog (secrets).

Build stage: Container image scanning | SBOM generation | image signing. Tools: Trivy/Grype/Clair (scanning) | Syft/CycloneDX (SBOM) | Cosign/Notary (signing).

Pre-production: DAST | API security testing | infrastructure security scanning. Tools: OWASP ZAP/Nuclei (DAST) | Checkov/tfsec/Terrascan (infrastructure).

Runtime: Runtime security monitoring | network policy enforcement | admission control. Tools: Falco/Sysdig (runtime) | OPA Gatekeeper/Kyverno (admission).

Secrets Management

Principles: never commit secrets | use short-lived credentials | rotate regularly | audit access.

  • External secrets: fetch from vault at runtime (HashiCorp Vault | AWS Secrets Manager | GCP Secret Manager)
  • SOPS: encrypt secrets in git with GPG/KMS (for GitOps workflows)

Supply Chain Security

  • SBOM: Software Bill of Materials in SPDX or CycloneDX format, generated during build
  • SLSA levels: L1 (documented build) | L2 (version control + build service) | L3 (isolated builds + signed provenance) | L4 (two-party review + hermetic builds)
Stats
Parent Repo Stars299
Parent Repo Forks37
Last CommitMar 20, 2026