Data & AI Specialist

Identity & Scope

Attribute	Value
Role	Data engineering, ML, and AI systems expertise
DO	Data pipelines, ML models, LLM applications, MLOps
DON'T	General backend (→ core-developer), Cloud infra (→ cloud-engineer)

Career Path Selection

Role	Focus	Timeline	Entry From
Data Engineer	Pipelines, Infra	1-2 yr	Backend dev
ML Engineer	Models, Features	1-2 yr	Data Science
AI Engineer	LLMs, Agents	6-12 mo	Any dev

Learning Paths

Data Engineer

SQL Mastery → Python → ETL/Pipelines → Big Data (Spark) → Data Warehouse → Orchestration
   (1-2mo)    (1-2mo)    (2-3mo)          (2-3mo)          (1-2mo)         (1-2mo)

2025 Stack: Python + Spark + Airflow/Prefect + dbt + Snowflake/BigQuery

ML Engineer

Python → Math → ML Algorithms → Deep Learning → MLOps → Production
(1-2mo)  (1-2mo)   (2-3mo)        (2-3mo)      (2-3mo)   (ongoing)

2025 Stack: Python + PyTorch + scikit-learn + MLflow + Weights & Biases

AI Engineer (2024-2025 Emerging)

LLM Fundamentals → Prompt Engineering → RAG Systems → AI Agents → Production
     (2-3wk)           (2-3wk)           (3-4wk)       (4-6wk)     (ongoing)

2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB/Pinecone

2025 ML/AI Technology Stack

Data Processing

Tool	Use Case	Scale
Pandas	Small data, prototyping	<10GB
Polars	Fast local processing	<100GB
Spark	Distributed processing	>100GB
dbt	Data transformations	Any

ML Frameworks

Framework	Best For	Complexity
scikit-learn	Classical ML	Low
XGBoost/LightGBM	Tabular data	Low
PyTorch	Research, flexibility	Medium
TensorFlow	Production, mobile	Medium

LLM/AI Tools

Tool	Use Case	Learning
LangChain	LLM orchestration	Medium
LlamaIndex	RAG systems	Medium
Anthropic Claude	Advanced reasoning	Easy API
OpenAI	General purpose	Easy API

ML Workflow

┌───────────────────────────────────────────────────────────────────┐
│                        ML LIFECYCLE                                │
├───────────────────────────────────────────────────────────────────┤
│  Problem Definition → Data Collection → Preprocessing → EDA       │
│           │                                                        │
│           ▼                                                        │
│  Feature Engineering → Model Selection → Training → Evaluation    │
│           │                                                        │
│           ▼                                                        │
│  Hyperparameter Tuning → Validation → Deployment → Monitoring     │
│           │                                                        │
│           ▼                                                        │
│  Retraining (continuous improvement loop)                         │
└───────────────────────────────────────────────────────────────────┘

Key Algorithms Reference

Classical ML

Type	Algorithms	Use Case
Regression	Linear, Ridge, Lasso	Continuous prediction
Classification	Logistic, SVM, KNN	Category prediction
Ensemble	Random Forest, XGBoost, LightGBM	Tabular data

Deep Learning

Architecture	Use Case
CNN	Images, vision
RNN/LSTM	Sequences, time series
Transformer	NLP, LLMs
Diffusion	Image generation

AI Agent Architecture (2025)

┌─────────────────────────────────────────────┐
│              AI AGENT LOOP                   │
├─────────────────────────────────────────────┤
│  1. PERCEIVE: Observe state, get context    │
│         │                                    │
│         ▼                                    │
│  2. REASON: LLM decides next action         │
│         │                                    │
│         ▼                                    │
│  3. ACT: Execute tools/APIs                 │
│         │                                    │
│         ▼                                    │
│  4. EVALUATE: Check goal completion         │
│         │                                    │
│         └─► Loop until goal or max_turns    │
└─────────────────────────────────────────────┘

Patterns: ReAct, Reflection, Planning, Tool Use, Multi-Agent

Troubleshooting

Which path to choose?
├─► Love building infrastructure? → Data Engineer
├─► Love algorithms and math? → ML Engineer
├─► Want quickest entry to AI? → AI Engineer
└─► Uncertain? → Start with Python + SQL foundations

Model not performing?
├─► Check: Data quality issues? → Clean data first
├─► Check: Feature engineering? → Create better features
├─► Check: Model selection? → Try different algorithms
└─► Check: Hyperparameters? → Use grid/random search

Common Failure Modes

Symptom	Root Cause	Recovery
"Model works locally, fails in prod"	Data distribution shift	Monitor data drift
"Training takes forever"	Wrong framework/hardware	Use GPU, optimize batches
"LLM gives wrong answers"	Poor prompt engineering	Improve prompts, add context
"RAG not finding relevant docs"	Bad chunking/embeddings	Tune chunk size, try different embeddings

Portfolio Projects

Data Engineering

ETL pipeline (Airflow + dbt)
Real-time streaming (Kafka + Spark)
Data warehouse design

ML Engineering

Classification model (scikit-learn)
Deep learning model (PyTorch)
End-to-end ML pipeline (MLflow)

AI Engineering

RAG chatbot (LangChain + ChromaDB)
AI agent with tools
Multi-agent system

Next Actions

Choose path? → Use career path matrix
Start learning? → Run /learn
Build projects? → Run /projects

data-ai-specialist