ML/CV Specialist

Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.

When to Use

Selecting ML models for specific use cases
Designing training and inference pipelines
Optimizing ML system performance and cost
Evaluating build vs. API for ML capabilities
Planning data pipelines for ML workloads

ML System Design Framework

Model Selection Decision Tree

Use Case Identified
    │
    ├─► Text/Language Tasks
    │   ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
    │   ├─► Generation → GPT-4, Claude, Llama (self-hosted)
    │   ├─► Embeddings → OpenAI Ada, sentence-transformers
    │   └─► Search/RAG → Vector DB + Embeddings + LLM
    │
    ├─► Computer Vision Tasks
    │   ├─► Classification → ResNet, EfficientNet, ViT
    │   ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
    │   ├─► Segmentation → SAM, Mask R-CNN, U-Net
    │   ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
    │   └─► Face Recognition → InsightFace, DeepFace
    │
    ├─► Audio Tasks
    │   ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
    │   ├─► Text-to-Speech → ElevenLabs, Coqui TTS
    │   └─► Audio Classification → PANNs, AudioSet models
    │
    └─► Structured Data
        ├─► Tabular → XGBoost, LightGBM, CatBoost
        ├─► Time Series → Prophet, ARIMA, Transformer-based
        └─► Recommendations → Two-tower, matrix factorization

API vs. Self-Hosted Decision

When to Use APIs

Factor	API Preferred	Self-Hosted Preferred
Volume	< 10K requests/month	> 100K requests/month
Latency	> 500ms acceptable	< 100ms required
Customization	General use case	Domain-specific fine-tuning
Data Privacy	Non-sensitive data	PII, HIPAA, financial
Team Expertise	No ML engineers	ML team available
Budget	Predictable per-call costs	High volume justifies infra

Cost Comparison Framework

## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700

## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time

## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality

Training Pipeline Architecture

Standard ML Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    DATA LAYER                                │
├─────────────────────────────────────────────────────────────┤
│  Data Sources → ETL → Feature Store → Training Data         │
│  (S3, DBs)     (Airflow)  (Feast)     (Versioned)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  TRAINING LAYER                              │
├─────────────────────────────────────────────────────────────┤
│  Experiment Tracking → Training Jobs → Model Registry       │
│  (MLflow, W&B)         (SageMaker)    (MLflow, S3)         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  SERVING LAYER                               │
├─────────────────────────────────────────────────────────────┤
│  Model Server → Load Balancer → Monitoring                  │
│  (TorchServe)   (K8s/ELB)      (Prometheus)                │
└─────────────────────────────────────────────────────────────┘

Component Selection Guide

Component	Options	Recommendation
Feature Store	Feast, Tecton, SageMaker	Feast (open source), Tecton (enterprise)
Experiment Tracking	MLflow, Weights & Biases, Neptune	MLflow (free), W&B (best UX)
Training Orchestration	Kubeflow, SageMaker, Vertex AI	SageMaker (AWS), Vertex (GCP)
Model Registry	MLflow, SageMaker, custom S3	MLflow (standard)
Model Serving	TorchServe, TFServing, Triton	Triton (multi-framework)

Inference Architecture Patterns

Pattern 1: Synchronous API

Best for: Low-latency requirements, simple integration

Client → API Gateway → Model Server → Response
                           │
                      Load Balancer
                           │
                    ┌──────┴──────┐
                    │             │
                Model Pod    Model Pod

Latency targets:

P50: < 100ms
P95: < 300ms
P99: < 500ms

Pattern 2: Asynchronous Processing

Best for: Long-running inference, batch processing

Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
                                          │
                                     S3/Redis

Use when:

Inference > 5 seconds
Batch processing required
Variable load patterns

Pattern 3: Edge Inference

Best for: Privacy, offline capability, ultra-low latency

┌─────────────────────────────────────────┐
│              EDGE DEVICE                 │
│  ┌─────────┐    ┌─────────────────────┐ │
│  │ Camera  │───▶│ Optimized Model     │ │
│  └─────────┘    │ (ONNX, TFLite)      │ │
│                 └─────────────────────┘ │
│                          │              │
│                     Local Result        │
└─────────────────────────────────────────┘
                           │
                    Sync to Cloud
                    (non-blocking)

Model optimization for edge:

Quantization (INT8): 4x smaller, 2-3x faster
Pruning: 50-90% sparsity possible
Distillation: Smaller model, similar accuracy
ONNX/TFLite: Optimized runtime

Computer Vision Pipeline Design

Real-Time Video Processing

Camera Stream → Frame Extraction → Preprocessing → Model → Postprocessing → Output
     │              │                   │            │           │
   RTSP/         1-30 FPS           Resize,      Batch or    NMS, tracking,
   WebRTC                           normalize    single       annotation

Performance optimization:

Process every Nth frame (skip frames)
Resize to model input size early
Batch frames when latency allows
Use GPU preprocessing (NVIDIA DALI)

Object Detection System

## Pipeline Components

1. **Input Processing**
   - Video decode: FFmpeg, OpenCV
   - Frame buffer: Ring buffer for temporal context
   - Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)

2. **Detection**
   - Model: YOLOv8 (speed), DETR (accuracy)
   - Batch size: 1-8 depending on latency requirements
   - Confidence threshold: 0.5-0.7 typical

3. **Post-processing**
   - NMS (Non-Maximum Suppression)
   - Tracking: SORT, DeepSORT, ByteTrack
   - Smoothing: Kalman filter for stable boxes

4. **Output**
   - Annotations: Bounding boxes, labels, confidence
   - Events: Trigger on detection (webhook, queue)
   - Storage: Frame + metadata to S3/DB

LLM Integration Patterns

RAG (Retrieval-Augmented Generation)

User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
                              │
                         Vector DB
                       (Pinecone, Weaviate,
                        Chroma, pgvector)

Vector DB Selection:

Database	Best For	Limitations
Pinecone	Managed, scale	Cost at scale
Weaviate	Self-hosted, features	Operational overhead
Chroma	Simple, local dev	Not for production scale
pgvector	PostgreSQL users	Performance at >1M vectors
Qdrant	Performance	Newer, smaller community

LLM Serving Architecture

┌─────────────────────────────────────────────────────────────┐
│                    API GATEWAY                               │
│  Rate limiting, auth, request routing                       │
└─────────────────────────────────────────────────────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
              ▼             ▼             ▼
         ┌────────┐   ┌────────┐   ┌────────┐
         │ GPT-4  │   │ Claude │   │ Local  │
         │  API   │   │  API   │   │ Llama  │
         └────────┘   └────────┘   └────────┘
                            │
                    Model Router
              (cost/latency/capability)

Multi-model strategy:

Simple queries → Cheaper model (GPT-3.5, Haiku)
Complex reasoning → Expensive model (GPT-4, Opus)
Sensitive data → Self-hosted (Llama, Mistral)

Performance Optimization

GPU Memory Optimization

Technique	Memory Reduction	Speed Impact
FP16 (Half Precision)	50%	Neutral to faster
INT8 Quantization	75%	10-20% slower
INT4 Quantization	87.5%	20-40% slower
Gradient Checkpointing	60-80%	20-30% slower
Model Sharding	Distributed	Communication overhead

Batching Strategies

# Dynamic batching pseudocode
class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=50):
        self.queue = []
        self.max_batch = max_batch
        self.max_wait = max_wait_ms

    async def add_request(self, request):
        self.queue.append(request)

        # Batch when full or timeout
        if len(self.queue) >= self.max_batch:
            return await self.process_batch()

        await asyncio.sleep(self.max_wait / 1000)
        return await self.process_batch()

    async def process_batch(self):
        batch = self.queue[:self.max_batch]
        self.queue = self.queue[self.max_batch:]
        return await self.model.predict_batch(batch)

Model Monitoring

Key Metrics to Track

Metric	What It Measures	Alert Threshold
Latency (P95)	Response time	> 2x baseline
Throughput	Requests/second	< 80% capacity
Error Rate	Failed predictions	> 1%
Model Drift	Distribution shift	PSI > 0.2
Data Quality	Input anomalies	> 5% anomalies

Drift Detection

Training Distribution ──┐
                        ├──► Statistical Test ──► Alert
Production Distribution ─┘
                         (PSI, KS test, JS divergence)

Population Stability Index (PSI):

PSI < 0.1: No significant change
0.1 < PSI < 0.2: Moderate change, monitor
PSI > 0.2: Significant change, investigate

Quick Reference Tables

Model Selection by Use Case

Use Case	Recommended Model	Latency	Cost
Text Classification	DistilBERT	10ms	Low
Text Generation	GPT-4 / Claude	1-5s	Medium
Image Classification	EfficientNet-B0	5ms	Low
Object Detection	YOLOv8-n	10ms	Low
Object Detection (Accurate)	YOLOv8-x	50ms	Medium
Semantic Segmentation	SAM	100ms	Medium
Speech-to-Text	Whisper-base	Real-time	Low
Embeddings	text-embedding-ada-002	50ms	Low

Infrastructure Sizing

Scale	GPU	Model Size	Throughput
Development	T4 (16GB)	< 7B params	10-50 req/s
Production Small	A10G (24GB)	< 13B params	50-100 req/s
Production Medium	A100 (40GB)	< 70B params	100-500 req/s
Production Large	A100 (80GB) x 2+	> 70B params	500+ req/s

References

Model Catalog - Detailed model comparison and benchmarks
Inference Patterns - Architecture patterns for different use cases

ml-cv-specialist