Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.
/plugin marketplace add alirezarezvani/claude-cto-team/plugin install cto-team@cto-team-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
model-catalog.mdProvides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.
Use Case Identified
│
├─► Text/Language Tasks
│ ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
│ ├─► Generation → GPT-4, Claude, Llama (self-hosted)
│ ├─► Embeddings → OpenAI Ada, sentence-transformers
│ └─► Search/RAG → Vector DB + Embeddings + LLM
│
├─► Computer Vision Tasks
│ ├─► Classification → ResNet, EfficientNet, ViT
│ ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
│ ├─► Segmentation → SAM, Mask R-CNN, U-Net
│ ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
│ └─► Face Recognition → InsightFace, DeepFace
│
├─► Audio Tasks
│ ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
│ ├─► Text-to-Speech → ElevenLabs, Coqui TTS
│ └─► Audio Classification → PANNs, AudioSet models
│
└─► Structured Data
├─► Tabular → XGBoost, LightGBM, CatBoost
├─► Time Series → Prophet, ARIMA, Transformer-based
└─► Recommendations → Two-tower, matrix factorization
| Factor | API Preferred | Self-Hosted Preferred |
|---|---|---|
| Volume | < 10K requests/month | > 100K requests/month |
| Latency | > 500ms acceptable | < 100ms required |
| Customization | General use case | Domain-specific fine-tuning |
| Data Privacy | Non-sensitive data | PII, HIPAA, financial |
| Team Expertise | No ML engineers | ML team available |
| Budget | Predictable per-call costs | High volume justifies infra |
## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700
## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time
## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality
┌─────────────────────────────────────────────────────────────┐
│ DATA LAYER │
├─────────────────────────────────────────────────────────────┤
│ Data Sources → ETL → Feature Store → Training Data │
│ (S3, DBs) (Airflow) (Feast) (Versioned) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TRAINING LAYER │
├─────────────────────────────────────────────────────────────┤
│ Experiment Tracking → Training Jobs → Model Registry │
│ (MLflow, W&B) (SageMaker) (MLflow, S3) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SERVING LAYER │
├─────────────────────────────────────────────────────────────┤
│ Model Server → Load Balancer → Monitoring │
│ (TorchServe) (K8s/ELB) (Prometheus) │
└─────────────────────────────────────────────────────────────┘
| Component | Options | Recommendation |
|---|---|---|
| Feature Store | Feast, Tecton, SageMaker | Feast (open source), Tecton (enterprise) |
| Experiment Tracking | MLflow, Weights & Biases, Neptune | MLflow (free), W&B (best UX) |
| Training Orchestration | Kubeflow, SageMaker, Vertex AI | SageMaker (AWS), Vertex (GCP) |
| Model Registry | MLflow, SageMaker, custom S3 | MLflow (standard) |
| Model Serving | TorchServe, TFServing, Triton | Triton (multi-framework) |
Best for: Low-latency requirements, simple integration
Client → API Gateway → Model Server → Response
│
Load Balancer
│
┌──────┴──────┐
│ │
Model Pod Model Pod
Latency targets:
Best for: Long-running inference, batch processing
Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
│
S3/Redis
Use when:
Best for: Privacy, offline capability, ultra-low latency
┌─────────────────────────────────────────┐
│ EDGE DEVICE │
│ ┌─────────┐ ┌─────────────────────┐ │
│ │ Camera │───▶│ Optimized Model │ │
│ └─────────┘ │ (ONNX, TFLite) │ │
│ └─────────────────────┘ │
│ │ │
│ Local Result │
└─────────────────────────────────────────┘
│
Sync to Cloud
(non-blocking)
Model optimization for edge:
Camera Stream → Frame Extraction → Preprocessing → Model → Postprocessing → Output
│ │ │ │ │
RTSP/ 1-30 FPS Resize, Batch or NMS, tracking,
WebRTC normalize single annotation
Performance optimization:
## Pipeline Components
1. **Input Processing**
- Video decode: FFmpeg, OpenCV
- Frame buffer: Ring buffer for temporal context
- Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)
2. **Detection**
- Model: YOLOv8 (speed), DETR (accuracy)
- Batch size: 1-8 depending on latency requirements
- Confidence threshold: 0.5-0.7 typical
3. **Post-processing**
- NMS (Non-Maximum Suppression)
- Tracking: SORT, DeepSORT, ByteTrack
- Smoothing: Kalman filter for stable boxes
4. **Output**
- Annotations: Bounding boxes, labels, confidence
- Events: Trigger on detection (webhook, queue)
- Storage: Frame + metadata to S3/DB
User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
│
Vector DB
(Pinecone, Weaviate,
Chroma, pgvector)
Vector DB Selection:
| Database | Best For | Limitations |
|---|---|---|
| Pinecone | Managed, scale | Cost at scale |
| Weaviate | Self-hosted, features | Operational overhead |
| Chroma | Simple, local dev | Not for production scale |
| pgvector | PostgreSQL users | Performance at >1M vectors |
| Qdrant | Performance | Newer, smaller community |
┌─────────────────────────────────────────────────────────────┐
│ API GATEWAY │
│ Rate limiting, auth, request routing │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ GPT-4 │ │ Claude │ │ Local │
│ API │ │ API │ │ Llama │
└────────┘ └────────┘ └────────┘
│
Model Router
(cost/latency/capability)
Multi-model strategy:
| Technique | Memory Reduction | Speed Impact |
|---|---|---|
| FP16 (Half Precision) | 50% | Neutral to faster |
| INT8 Quantization | 75% | 10-20% slower |
| INT4 Quantization | 87.5% | 20-40% slower |
| Gradient Checkpointing | 60-80% | 20-30% slower |
| Model Sharding | Distributed | Communication overhead |
# Dynamic batching pseudocode
class DynamicBatcher:
def __init__(self, max_batch=32, max_wait_ms=50):
self.queue = []
self.max_batch = max_batch
self.max_wait = max_wait_ms
async def add_request(self, request):
self.queue.append(request)
# Batch when full or timeout
if len(self.queue) >= self.max_batch:
return await self.process_batch()
await asyncio.sleep(self.max_wait / 1000)
return await self.process_batch()
async def process_batch(self):
batch = self.queue[:self.max_batch]
self.queue = self.queue[self.max_batch:]
return await self.model.predict_batch(batch)
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Latency (P95) | Response time | > 2x baseline |
| Throughput | Requests/second | < 80% capacity |
| Error Rate | Failed predictions | > 1% |
| Model Drift | Distribution shift | PSI > 0.2 |
| Data Quality | Input anomalies | > 5% anomalies |
Training Distribution ──┐
├──► Statistical Test ──► Alert
Production Distribution ─┘
(PSI, KS test, JS divergence)
Population Stability Index (PSI):
| Use Case | Recommended Model | Latency | Cost |
|---|---|---|---|
| Text Classification | DistilBERT | 10ms | Low |
| Text Generation | GPT-4 / Claude | 1-5s | Medium |
| Image Classification | EfficientNet-B0 | 5ms | Low |
| Object Detection | YOLOv8-n | 10ms | Low |
| Object Detection (Accurate) | YOLOv8-x | 50ms | Medium |
| Semantic Segmentation | SAM | 100ms | Medium |
| Speech-to-Text | Whisper-base | Real-time | Low |
| Embeddings | text-embedding-ada-002 | 50ms | Low |
| Scale | GPU | Model Size | Throughput |
|---|---|---|---|
| Development | T4 (16GB) | < 7B params | 10-50 req/s |
| Production Small | A10G (24GB) | < 13B params | 50-100 req/s |
| Production Medium | A100 (40GB) | < 70B params | 100-500 req/s |
| Production Large | A100 (80GB) x 2+ | > 70B params | 500+ req/s |
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.