Model Serving Skill

Learn: Deploy ML models for production inference with optimization.

Skill Overview

Attribute	Value
Bonded Agent	05-model-serving
Difficulty	Intermediate to Advanced
Duration	35 hours
Prerequisites	mlops-basics, training-pipelines

Learning Objectives

Deploy models with BentoML and Triton
Optimize inference with quantization and ONNX
Configure auto-scaling policies
Implement batch and streaming inference
Deploy to edge devices

Topics Covered

Module 1: Serving Platforms (8 hours)

Platform Comparison:

Platform	Multi-framework	Dynamic Batching	Kubernetes
TorchServe	PyTorch only	✅	✅
Triton	✅	✅	✅
BentoML	✅	✅	✅
Seldon	✅	⚠️	✅

Module 2: BentoML Deployment (10 hours)

Service Definition:

import bentoml
from bentoml.io import JSON, NumpyNdarray

@bentoml.service(resources={"gpu": 1, "memory": "4Gi"})
class ModelService:
    def __init__(self):
        self.model = bentoml.pytorch.load_model("model:latest")

    @bentoml.api(route="/predict")
    async def predict(self, input_array: np.ndarray) -> dict:
        with torch.no_grad():
            predictions = self.model(input_array)
        return {"predictions": predictions.tolist()}

Exercises:

Create BentoML service for your model
Containerize and deploy to Kubernetes
Configure traffic management

Module 3: Inference Optimization (10 hours)

Optimization Techniques:

# 1. Dynamic Quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# 2. ONNX Export
torch.onnx.export(model, sample_input, "model.onnx")

# 3. TensorRT Conversion
import tensorrt as trt
# Convert ONNX to TensorRT for NVIDIA GPUs

Expected Speedups:

Technique	Speedup	Accuracy Impact
FP16	2-3x	<1%
INT8	3-4x	1-2%
TensorRT	5-10x	<1%

Module 4: Scaling & Monitoring (7 hours)

Kubernetes HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-serving
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Code Templates

Template: Production Serving

# templates/serving.py
from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()

class ProductionServer:
    def __init__(self, model_path: str):
        self.model = torch.jit.load(model_path)
        self.model.eval()

    def predict(self, inputs: np.ndarray) -> np.ndarray:
        with torch.no_grad():
            tensor = torch.from_numpy(inputs)
            outputs = self.model(tensor)
        return outputs.numpy()

server = ProductionServer("model.pt")

@app.post("/predict")
async def predict(data: dict):
    inputs = np.array(data["inputs"])
    predictions = server.predict(inputs)
    return {"predictions": predictions.tolist()}

Troubleshooting Guide

Issue	Cause	Solution
High latency	No optimization	Apply quantization, batching
Cold starts	Serverless	Pre-warming, min replicas
OOM	Model too large	Optimize, reduce batch size

Resources

BentoML Documentation
Triton Inference Server
ONNX Runtime
[See: ml-monitoring] - Monitor deployed models

Version History

Version	Date	Changes
2.0.0	2024-12	Production-grade with optimization
1.0.0	2024-11	Initial release

model-serving

Model Serving Skill

Skill Overview

Learning Objectives

Topics Covered

Module 1: Serving Platforms (8 hours)

Module 2: BentoML Deployment (10 hours)

Module 3: Inference Optimization (10 hours)

Module 4: Scaling & Monitoring (7 hours)

Code Templates

Template: Production Serving

Troubleshooting Guide

Resources

Version History

Similar Skills