Master model serving - inference optimization, scaling, deployment, edge serving
Deploy ML models for production inference using BentoML and Triton with quantization, ONNX optimization, and Kubernetes auto-scaling. Use when you need to serve models with low latency and high throughput.
/plugin marketplace add pluginagentmarketplace/custom-plugin-mlops/plugin install custom-plugin-mlops@pluginagentmarketplace-mlopsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyLearn: Deploy ML models for production inference with optimization.
| Attribute | Value |
|---|---|
| Bonded Agent | 05-model-serving |
| Difficulty | Intermediate to Advanced |
| Duration | 35 hours |
| Prerequisites | mlops-basics, training-pipelines |
Platform Comparison:
| Platform | Multi-framework | Dynamic Batching | Kubernetes |
|---|---|---|---|
| TorchServe | PyTorch only | ✅ | ✅ |
| Triton | ✅ | ✅ | ✅ |
| BentoML | ✅ | ✅ | ✅ |
| Seldon | ✅ | ⚠️ | ✅ |
Service Definition:
import bentoml
from bentoml.io import JSON, NumpyNdarray
@bentoml.service(resources={"gpu": 1, "memory": "4Gi"})
class ModelService:
def __init__(self):
self.model = bentoml.pytorch.load_model("model:latest")
@bentoml.api(route="/predict")
async def predict(self, input_array: np.ndarray) -> dict:
with torch.no_grad():
predictions = self.model(input_array)
return {"predictions": predictions.tolist()}
Exercises:
Optimization Techniques:
# 1. Dynamic Quantization
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# 2. ONNX Export
torch.onnx.export(model, sample_input, "model.onnx")
# 3. TensorRT Conversion
import tensorrt as trt
# Convert ONNX to TensorRT for NVIDIA GPUs
Expected Speedups:
| Technique | Speedup | Accuracy Impact |
|---|---|---|
| FP16 | 2-3x | <1% |
| INT8 | 3-4x | 1-2% |
| TensorRT | 5-10x | <1% |
Kubernetes HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# templates/serving.py
from fastapi import FastAPI
import torch
import numpy as np
app = FastAPI()
class ProductionServer:
def __init__(self, model_path: str):
self.model = torch.jit.load(model_path)
self.model.eval()
def predict(self, inputs: np.ndarray) -> np.ndarray:
with torch.no_grad():
tensor = torch.from_numpy(inputs)
outputs = self.model(tensor)
return outputs.numpy()
server = ProductionServer("model.pt")
@app.post("/predict")
async def predict(data: dict):
inputs = np.array(data["inputs"])
predictions = server.predict(inputs)
return {"predictions": predictions.tolist()}
| Issue | Cause | Solution |
|---|---|---|
| High latency | No optimization | Apply quantization, batching |
| Cold starts | Serverless | Pre-warming, min replicas |
| OOM | Model too large | Optimize, reduce batch size |
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2024-12 | Production-grade with optimization |
| 1.0.0 | 2024-11 | Initial release |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.