Help us improve
Share bugs, ideas, or general feedback.
From model-deployment
Deploys ML models to production with FastAPI for serving predictions, Docker for containerization, Kubernetes for orchestration. Handles monitoring, drift detection, latency issues, health checks, version conflicts.
npx claudepluginhub secondsky/claude-skills --plugin model-deploymentHow this skill is triggered — by the user, by Claude, or both
Slash command
/model-deployment:model-deploymentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deploy trained models to production with proper serving and monitoring.
Deploys trained ML models to production via REST APIs, Docker containers, Kubernetes clusters, with data validation, error handling, and performance monitoring.
Deploys ML models to production serving infrastructure using MLflow, BentoML, or Seldon Core with REST/gRPC endpoints. Implements autoscaling, monitoring, and A/B testing for real-time inference.
Guides MLOps workflows for ML model deployment: readiness checklists, serving infrastructure (FastAPI, SageMaker, Triton), inference optimization, versioning, A/B testing, drift detection, retraining, and monitoring.
Share bugs, ideas, or general feedback.
Deploy trained models to production with proper serving and monitoring.
| Method | Use Case | Latency |
|---|---|---|
| REST API | Web services | Medium |
| Batch | Large-scale processing | N/A |
| Streaming | Real-time | Low |
| Edge | On-device | Very low |
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
model = joblib.load('model.pkl')
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: float
probability: float
@app.get('/health')
def health():
return {'status': 'healthy'}
@app.post('/predict', response_model=PredictionResponse)
def predict(request: PredictionRequest):
features = np.array(request.features).reshape(1, -1)
prediction = model.predict(features)[0]
probability = model.predict_proba(features)[0].max()
return PredictionResponse(prediction=prediction, probability=probability)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pkl .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
class ModelMonitor:
def __init__(self):
self.predictions = []
self.latencies = []
def log_prediction(self, input_data, prediction, latency):
self.predictions.append({
'input': input_data,
'prediction': prediction,
'latency': latency,
'timestamp': datetime.now()
})
def detect_drift(self, reference_distribution):
# Compare current predictions to reference
pass
# 1. Save trained model
import joblib
joblib.dump(model, 'model.pkl')
# 2. Create FastAPI app (see references/fastapi-production-server.md)
# app.py with /predict and /health endpoints
# 3. Create Dockerfile
cat > Dockerfile << 'EOF'
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py model.pkl ./
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
EOF
# 4. Build and test locally
docker build -t model-api:v1.0.0 .
docker run -p 8000:8000 model-api:v1.0.0
# 5. Push to registry
docker tag model-api:v1.0.0 registry.example.com/model-api:v1.0.0
docker push registry.example.com/model-api:v1.0.0
# 6. Deploy to Kubernetes
kubectl apply -f deployment.yaml
kubectl rollout status deployment/model-api
Problem: Load balancer sends traffic to unhealthy pods, causing 503 errors.
Solution: Implement both liveness and readiness probes:
# app.py
@app.get("/health") # Liveness: Is service alive?
async def health():
return {"status": "healthy"}
@app.get("/ready") # Readiness: Can handle traffic?
async def ready():
try:
_ = model_store.model # Verify model loaded
return {"status": "ready"}
except:
raise HTTPException(503, "Not ready")
# deployment.yaml
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
Problem: FileNotFoundError: model.pkl when container starts.
Solution: Verify model file is copied in Dockerfile and path matches:
# ❌ Wrong: Model in wrong directory
COPY model.pkl /app/models/ # But code expects /app/model.pkl
# ✅ Correct: Consistent paths
COPY model.pkl /models/model.pkl
ENV MODEL_PATH=/models/model.pkl
# In Python:
model_path = os.getenv("MODEL_PATH", "/models/model.pkl")
Problem: Invalid inputs crash API with unhandled exceptions.
Solution: Use Pydantic for automatic validation:
from pydantic import BaseModel, Field, validator
class PredictionRequest(BaseModel):
features: List[float] = Field(..., min_items=1, max_items=100)
@validator('features')
def validate_finite(cls, v):
if not all(np.isfinite(val) for val in v):
raise ValueError("All features must be finite")
return v
# FastAPI auto-validates and returns 422 for invalid requests
@app.post("/predict")
async def predict(request: PredictionRequest):
# Request is guaranteed valid here
pass
Problem: Model performance degrades over time, no one notices until users complain.
Solution: Implement drift detection (see references/model-monitoring-drift.md):
monitor = ModelMonitor(reference_data=training_data, drift_threshold=0.1)
@app.post("/predict")
async def predict(request: PredictionRequest):
prediction = model.predict(features)
monitor.log_prediction(features, prediction, latency)
# Alert if drift detected
if monitor.should_retrain():
alert_manager.send_alert("Model drift detected - retrain recommended")
return prediction
Problem: Pod killed by Kubernetes OOMKiller, service goes down.
Solution: Set memory/CPU limits and requests:
resources:
requests:
memory: "512Mi" # Guaranteed
cpu: "500m"
limits:
memory: "1Gi" # Max allowed
cpu: "1000m"
# Monitor actual usage:
kubectl top pods
Problem: New model version has bugs, no way to revert quickly.
Solution: Tag images with versions, keep previous deployment:
# Deploy with version tag
kubectl set image deployment/model-api model-api=registry/model-api:v1.2.0
# If issues, rollback to previous
kubectl rollout undo deployment/model-api
# Or specify version
kubectl set image deployment/model-api model-api=registry/model-api:v1.1.0
Problem: Processing 10,000 predictions one-by-one takes hours.
Solution: Implement batch endpoint:
@app.post("/predict/batch")
async def predict_batch(request: BatchPredictionRequest):
# Process all at once (vectorized)
features = np.array(request.instances)
predictions = model.predict(features) # Much faster!
return {"predictions": predictions.tolist()}
Problem: Deploying model that fails basic tests, breaking production.
Solution: Validate in CI pipeline (see references/cicd-ml-models.md):
# .github/workflows/deploy.yml
- name: Validate model performance
run: |
python scripts/validate_model.py \
--model model.pkl \
--test-data test.csv \
--min-accuracy 0.85 # Fail if below threshold
Load reference files for detailed implementations:
FastAPI Production Server: Load references/fastapi-production-server.md for complete production-ready FastAPI implementation with error handling, validation (Pydantic models), logging, health/readiness probes, batch predictions, model versioning, middleware, exception handlers, and performance optimizations (caching, async)
Model Monitoring & Drift: Load references/model-monitoring-drift.md for ModelMonitor implementation with KS-test drift detection, Jensen-Shannon divergence, Prometheus metrics integration, alert configuration (Slack, email), continuous monitoring service, and dashboard endpoints
Containerization & Deployment: Load references/containerization-deployment.md for multi-stage Dockerfiles, model versioning in containers, Docker Compose setup, A/B testing with Nginx, Kubernetes deployments (rolling update, blue-green, canary), GitHub Actions CI/CD, and deployment checklists
CI/CD for ML Models: Load references/cicd-ml-models.md for complete GitHub Actions pipeline with model validation, data validation, automated testing, security scanning, performance benchmarks, automated rollback, and deployment strategies