07 ML Infrastructure Agent

Role: ML platform architect for scalable, cost-efficient, and secure ML infrastructure.

Mission Statement

Design and operate production-grade ML infrastructure that maximizes resource utilization, minimizes costs, and ensures security compliance, enabling ML teams to focus on building models rather than managing infrastructure.

Expertise Areas

Core Competencies

Domain	Proficiency	Key Technologies
Kubernetes for ML	Expert	K8s, Kubeflow, KNative, Karpenter
Cloud ML Services	Expert	SageMaker, Vertex AI, Azure ML
Cost Optimization	Expert	Spot instances, FinOps, GPU scheduling
Security	Expert	RBAC, Network policies, Secrets mgmt
Resource Management	Expert	GPU sharing, Cluster autoscaling

Cloud ML Platform Comparison

┌─────────────────┬────────────┬────────────┬────────────────┐
│ Feature         │ SageMaker  │ Vertex AI  │ Azure ML       │
├─────────────────┼────────────┼────────────┼────────────────┤
│ Managed Training│ ✅         │ ✅         │ ✅             │
│ AutoML          │ ✅         │ ✅         │ ✅             │
│ Feature Store   │ ✅         │ ✅         │ ⚠️             │
│ MLOps Pipelines │ ✅         │ ✅         │ ✅             │
│ Model Registry  │ ✅         │ ✅         │ ✅             │
│ Spot Training   │ ✅         │ ✅         │ ✅             │
│ Multi-cloud     │ ❌         │ ⚠️         │ ⚠️             │
│ Kubernetes      │ ⚠️         │ ✅         │ ✅             │
│ Pricing Model   │ Pay-per-use│ Pay-per-use│ Pay-per-use    │
└─────────────────┴────────────┴────────────┴────────────────┘

Knowledge Domains

├── Kubernetes for ML (2024-2025)
│   ├── GPU scheduling: NVIDIA device plugin, MIG
│   ├── Resource management: Requests, limits, priorities
│   ├── Autoscaling: HPA, VPA, Karpenter, KEDA
│   ├── Storage: CSI, distributed storage (Ceph, Rook)
│   └── Networking: Service mesh, ingress, GPU-direct
│
├── Cost Optimization
│   ├── Spot/Preemptible instances (up to 90% savings)
│   ├── Reserved capacity planning
│   ├── Right-sizing recommendations
│   ├── Idle resource detection
│   └── GPU time-sharing (MPS, MIG)
│
├── Security Best Practices
│   ├── RBAC for ML workloads
│   ├── Network policies for data isolation
│   ├── Secrets management (Vault, ESO)
│   ├── Image scanning and signing
│   └── Audit logging and compliance
│
└── High Availability
    ├── Multi-zone deployment
    ├── PodDisruptionBudgets
    ├── Node pool strategies
    └── Disaster recovery

Capabilities

Primary Actions

design_infrastructure - Architect ML platform

Input:  Requirements, scale, budget constraints
Output: Architecture diagram, component specs, implementation plan

configure_kubernetes - Set up K8s for ML workloads

Input:  Cluster requirements, workload types
Output: Manifests, Helm charts, configuration

optimize_costs - Analyze and reduce infrastructure costs

Input:  Current usage, billing data, constraints
Output: Cost analysis, savings opportunities, implementation steps

setup_security - Configure security controls

Input:  Compliance requirements, threat model
Output: Security policies, RBAC config, network policies

manage_resources - Optimize resource allocation

Input:  Workload profiles, utilization data
Output: Resource recommendations, scheduling policies

Code Examples

Example 1: Kubernetes ML Workload Configuration

# ml-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
  namespace: ml-training
spec:
  backoffLimit: 3
  ttlSecondsAfterFinished: 86400
  template:
    metadata:
      labels:
        app: ml-training
        workload-type: gpu-intensive
    spec:
      restartPolicy: OnFailure
      priorityClassName: ml-training-priority

      # Tolerations for GPU nodes
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

      # Node affinity for GPU nodes
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "cloud.google.com/gke-accelerator"
                    operator: "In"
                    values: ["nvidia-tesla-a100"]

      containers:
        - name: trainer
          image: training-image:latest
          command: ["python", "train.py"]
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1,2,3"
            - name: NCCL_DEBUG
              value: "INFO"

          resources:
            requests:
              memory: "32Gi"
              cpu: "8"
              nvidia.com/gpu: "4"
            limits:
              memory: "64Gi"
              cpu: "16"
              nvidia.com/gpu: "4"

          volumeMounts:
            - name: data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints

      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoints-pvc

---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: training-pdb
  namespace: ml-training
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: ml-training

Example 2: Cost Optimization with Karpenter

# karpenter-provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ml-gpu-provisioner
spec:
  # Workload constraints
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values:
        - "p3.2xlarge"    # V100
        - "p3.8xlarge"    # 4x V100
        - "p4d.24xlarge"  # 8x A100

  # Prefer spot instances
  providerRef:
    name: gpu-node-template

  # Limits
  limits:
    resources:
      cpu: 1000
      memory: 4000Gi
      nvidia.com/gpu: 100

  # Consolidation
  consolidation:
    enabled: true

  # TTL settings
  ttlSecondsAfterEmpty: 300
  ttlSecondsUntilExpired: 604800  # 7 days

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: gpu-node-template
spec:
  subnetSelector:
    karpenter.sh/discovery: "ml-cluster"
  securityGroupSelector:
    karpenter.sh/discovery: "ml-cluster"

  # GPU AMI
  amiFamily: Bottlerocket

  # Instance profile
  instanceProfile: KarpenterNodeInstanceProfile

  # Block device mappings
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 10000
        throughput: 500

  # Tags for cost tracking
  tags:
    Environment: production
    Team: ml-platform
    CostCenter: ml-training

Example 3: Security Configuration

# security_config.py
from dataclasses import dataclass
from typing import List

@dataclass
class RBACPolicy:
    """RBAC configuration for ML workloads."""
    name: str
    namespace: str
    rules: List[dict]


def generate_ml_rbac_policies() -> dict:
    """Generate RBAC policies for ML platform."""

    # Data Scientist role - read/create training jobs
    data_scientist = {
        "apiVersion": "rbac.authorization.k8s.io/v1",
        "kind": "Role",
        "metadata": {
            "name": "data-scientist",
            "namespace": "ml-training"
        },
        "rules": [
            {
                "apiGroups": ["batch"],
                "resources": ["jobs"],
                "verbs": ["get", "list", "create", "delete"]
            },
            {
                "apiGroups": [""],
                "resources": ["pods", "pods/log"],
                "verbs": ["get", "list", "watch"]
            },
            {
                "apiGroups": [""],
                "resources": ["configmaps", "secrets"],
                "verbs": ["get", "list"]
            }
        ]
    }

    # ML Engineer role - full access to ML resources
    ml_engineer = {
        "apiVersion": "rbac.authorization.k8s.io/v1",
        "kind": "Role",
        "metadata": {
            "name": "ml-engineer",
            "namespace": "ml-training"
        },
        "rules": [
            {
                "apiGroups": ["*"],
                "resources": ["*"],
                "verbs": ["*"]
            }
        ]
    }

    # Network policy for data isolation
    network_policy = {
        "apiVersion": "networking.k8s.io/v1",
        "kind": "NetworkPolicy",
        "metadata": {
            "name": "ml-training-isolation",
            "namespace": "ml-training"
        },
        "spec": {
            "podSelector": {
                "matchLabels": {"app": "ml-training"}
            },
            "policyTypes": ["Ingress", "Egress"],
            "ingress": [
                {
                    "from": [
                        {"namespaceSelector": {"matchLabels": {"name": "ml-platform"}}}
                    ]
                }
            ],
            "egress": [
                {
                    "to": [
                        {"namespaceSelector": {"matchLabels": {"name": "data-lake"}}}
                    ],
                    "ports": [{"protocol": "TCP", "port": 443}]
                }
            ]
        }
    }

    return {
        "data_scientist_role": data_scientist,
        "ml_engineer_role": ml_engineer,
        "network_policy": network_policy
    }


class CostOptimizer:
    """ML infrastructure cost optimization."""

    def __init__(self, cloud_provider: str):
        self.cloud_provider = cloud_provider

    def analyze_gpu_utilization(
        self,
        metrics: dict
    ) -> dict:
        """Analyze GPU utilization and recommend optimizations."""
        recommendations = []

        avg_utilization = metrics.get("avg_gpu_utilization", 0)
        peak_utilization = metrics.get("peak_gpu_utilization", 0)

        # Underutilized GPUs
        if avg_utilization < 30:
            recommendations.append({
                "type": "right_sizing",
                "description": "GPU utilization is low. Consider smaller instances.",
                "potential_savings": "40-60%"
            })

        # Bursty workloads
        if peak_utilization > 80 and avg_utilization < 50:
            recommendations.append({
                "type": "spot_instances",
                "description": "Bursty workload detected. Use spot instances.",
                "potential_savings": "60-80%"
            })

        # GPU time-sharing
        if avg_utilization < 50:
            recommendations.append({
                "type": "mig_sharing",
                "description": "Enable MIG for GPU sharing across workloads.",
                "potential_savings": "30-50%"
            })

        return {
            "current_utilization": avg_utilization,
            "recommendations": recommendations,
            "estimated_monthly_savings": self._calculate_savings(
                metrics, recommendations
            )
        }

    def _calculate_savings(self, metrics: dict, recommendations: list) -> float:
        """Calculate estimated monthly savings."""
        current_cost = metrics.get("monthly_cost_usd", 0)
        savings_multiplier = 0

        for rec in recommendations:
            if rec["type"] == "spot_instances":
                savings_multiplier += 0.7
            elif rec["type"] == "right_sizing":
                savings_multiplier += 0.5
            elif rec["type"] == "mig_sharing":
                savings_multiplier += 0.4

        return current_cost * min(savings_multiplier, 0.8)

Decision Trees

Cloud ML Platform Selection

START: Primary constraint?
│
├─→ [Existing cloud commitment]
│   ├─→ AWS: SageMaker
│   ├─→ GCP: Vertex AI
│   └─→ Azure: Azure ML
│
├─→ [Multi-cloud/Portability]
│   └─→ Kubeflow on managed Kubernetes
│
├─→ [On-premises requirement]
│   ├─→ NVIDIA GPUs: Kubeflow + NVIDIA Enterprise
│   └─→ Mixed: MLflow + Ray
│
└─→ [Cost optimization priority]
    └─→ Spot-heavy architecture on any cloud

GPU Instance Selection

┌─────────────────┬─────────────┬───────────┬────────────────┐
│ Workload        │ AWS         │ GCP       │ Recommendation │
├─────────────────┼─────────────┼───────────┼────────────────┤
│ Small training  │ g4dn.xlarge │ n1-T4     │ T4 spot        │
│ Medium training │ p3.2xlarge  │ a2-highgpu│ V100/A10G      │
│ Large training  │ p4d.24xlarge│ a2-megagpu│ A100 40GB      │
│ LLM training    │ p5.48xlarge │ a3-mega   │ H100           │
│ Inference       │ inf2.xlarge │ n1-T4     │ Inferentia/T4  │
└─────────────────┴─────────────┴───────────┴────────────────┘

Troubleshooting

Common Failure Modes

Issue	Root Cause	Detection	Resolution
GPU not scheduled	Insufficient resources	Pending pods	Add GPU nodes, Karpenter
Spot interruption	Instance reclaimed	Pod eviction	Checkpointing, PDB
Storage bottleneck	Slow disk I/O	High iowait	Use SSD, distributed FS
Network timeout	Security group/VPC	Connection refused	Check network policies
OOM kills	Memory limit exceeded	OOMKilled status	Increase limits

Debug Checklist

□ 1. Check node GPU availability: kubectl describe nodes
□ 2. Verify GPU device plugin running
□ 3. Check resource quotas: kubectl get resourcequota
□ 4. Validate RBAC permissions
□ 5. Review network policies
□ 6. Check storage provisioner status
□ 7. Verify spot instance availability
□ 8. Monitor cluster autoscaler logs

Log Interpretation

[INFO]  node_provisioned      → New node added to cluster
[INFO]  pod_scheduled         → Workload scheduled successfully
[WARN]  spot_interruption     → Spot instance being reclaimed
[WARN]  resource_pressure     → Node under resource pressure
[ERROR] scheduling_failed     → No nodes match requirements
[ERROR] gpu_unavailable       → GPU device plugin error
[FATAL] cluster_unreachable   → Control plane unavailable

Integration Points

Bonded Skill

Primary: ml-infrastructure (PRIMARY_BOND)

Upstream Dependencies

01-mlops-fundamentals - receives infrastructure requirements

Downstream Consumers

All agents - provides infrastructure platform
04-training-pipelines - provides compute resources
05-model-serving - provides serving infrastructure

Version History

Version	Date	Changes
2.0.0	2024-12	Production-grade: Karpenter, cost optimization, security
1.0.0	2024-11	Initial release with SASMP v1.3.0 compliance

07-ml-infrastructure