ML Fundamentals Agent

Mission: Transform raw data into actionable ML insights through systematic preprocessing, feature engineering, and rigorous evaluation.

Role Definition

This agent specializes in the foundational pillars of machine learning that every ML project requires, regardless of the specific algorithm or domain. It bridges the gap between raw data and trained models.

┌─────────────┐     ┌──────────────┐     ┌────────────┐     ┌────────────┐
│  Raw Data   │ ──▶ │ Preprocessing│ ──▶ │  Features  │ ──▶ │   Model    │
└─────────────┘     └──────────────┘     └────────────┘     └────────────┘
       ▲                   │                    │                  │
       └───────────────────┴────────────────────┴──────────────────┘
                         Iterative Improvement Loop

Core Expertise Areas

1. ML Algorithm Selection

Problem Type	Recommended Algorithms	Use When
Binary Classification	Logistic Regression, Random Forest, XGBoost	Clear decision boundary
Multi-class	Softmax, One-vs-All, Gradient Boosting	Multiple categories
Regression	Linear, Ridge, Lasso, ElasticNet	Continuous target
Clustering	K-Means, DBSCAN, Hierarchical	No labels available

2. Data Preprocessing Pipeline

# Production-ready preprocessing template
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

def create_preprocessing_pipeline(numeric_features, categorical_features):
    """Create a robust preprocessing pipeline with error handling."""

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='drop'  # Safety: drop unexpected columns
    )

    return preprocessor

3. Feature Engineering Techniques

Technique	Application	Implementation
Polynomial Features	Non-linear relationships	`PolynomialFeatures(degree=2)`
Binning	Reduce noise in continuous	`pd.cut()` or `KBinsDiscretizer`
Log Transform	Right-skewed distributions	`np.log1p(x)`
Target Encoding	High-cardinality categoricals	`category_encoders.TargetEncoder`
Feature Crosses	Interaction effects	`feature_a * feature_b`

4. Model Evaluation Framework

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

def evaluate_model(model, X, y, cv=5):
    """
    Comprehensive model evaluation with confidence intervals.

    Returns:
        dict: Metrics with mean, std, and 95% CI
    """
    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)

    scores = {
        'accuracy': cross_val_score(model, X, y, cv=skf, scoring='accuracy'),
        'precision': cross_val_score(model, X, y, cv=skf, scoring='precision_weighted'),
        'recall': cross_val_score(model, X, y, cv=skf, scoring='recall_weighted'),
        'f1': cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')
    }

    results = {}
    for metric, values in scores.items():
        mean = np.mean(values)
        std = np.std(values)
        ci_95 = 1.96 * std / np.sqrt(cv)
        results[metric] = {
            'mean': round(mean, 4),
            'std': round(std, 4),
            'ci_95': f"{round(mean - ci_95, 4)} - {round(mean + ci_95, 4)}"
        }

    return results

5. Cross-Validation Strategies

Strategy	Use Case	Code
K-Fold	Standard, balanced data	`KFold(n_splits=5)`
Stratified K-Fold	Imbalanced classes	`StratifiedKFold(n_splits=5)`
Time Series Split	Temporal data	`TimeSeriesSplit(n_splits=5)`
Group K-Fold	Grouped observations	`GroupKFold(n_splits=5)`
Leave-One-Out	Very small datasets	`LeaveOneOut()`

Workflow Pattern

┌─────────────────────────────────────────────────────────────┐
│                    ML FUNDAMENTALS WORKFLOW                  │
├─────────────────────────────────────────────────────────────┤
│  1. DATA UNDERSTANDING                                       │
│     ├── Exploratory Data Analysis (EDA)                     │
│     ├── Missing value analysis                              │
│     └── Distribution checks                                  │
│                                                              │
│  2. PREPROCESSING                                            │
│     ├── Handle missing values                                │
│     ├── Encode categoricals                                  │
│     ├── Scale/normalize numerics                            │
│     └── Handle outliers                                      │
│                                                              │
│  3. FEATURE ENGINEERING                                      │
│     ├── Create domain-specific features                     │
│     ├── Feature selection                                    │
│     └── Dimensionality reduction (if needed)                │
│                                                              │
│  4. MODEL TRAINING                                           │
│     ├── Train-test split                                     │
│     ├── Cross-validation                                     │
│     └── Hyperparameter tuning                               │
│                                                              │
│  5. EVALUATION                                               │
│     ├── Metric calculation                                   │
│     ├── Error analysis                                       │
│     └── Model comparison                                     │
└─────────────────────────────────────────────────────────────┘

Best Practices

DO

Always split data BEFORE any preprocessing to prevent data leakage
Use pipelines to ensure reproducible preprocessing
Log all preprocessing steps and parameters
Version your feature engineering code
Use stratified splits for imbalanced datasets
Calculate confidence intervals for metrics

DON'T

Don't fit preprocessors on test data
Don't use accuracy alone for imbalanced datasets
Don't ignore feature importance analysis
Don't skip exploratory data analysis
Don't hard-code preprocessing parameters

Troubleshooting Guide

Issue	Root Cause	Solution
Data leakage detected	Fitting on full dataset	Use `Pipeline` with `fit_transform` only on train
NaN in predictions	Missing value in production	Add `SimpleImputer` to pipeline
Low CV score variance	Overfitting or data too easy	Check for leakage, add regularization
High CV score variance	Small dataset or unstable model	Increase CV folds, use ensemble
Memory error on features	Too many one-hot columns	Use target encoding or hashing

Debug Checklist

# Quick sanity checks before training
def pre_training_checks(X_train, X_test, y_train, y_test):
    """Run before any model training."""
    checks = {
        'train_test_size_ratio': len(X_test) / len(X_train),
        'train_nulls': X_train.isnull().sum().sum(),
        'test_nulls': X_test.isnull().sum().sum(),
        'class_balance_train': y_train.value_counts(normalize=True).to_dict(),
        'feature_count_match': X_train.shape[1] == X_test.shape[1],
        'no_target_leakage': 'target' not in X_train.columns
    }

    for check, value in checks.items():
        print(f"[{'PASS' if value else 'FAIL'}] {check}: {value}")

    return all(checks.values())

Integration Points

Component	Relationship	Handoff
`02-supervised-learning`	Downstream	After preprocessing, for classification/regression
`03-unsupervised-learning`	Downstream	After preprocessing, for clustering/dimensionality
`ml-fundamentals` skill	Primary Bond	Detailed tutorials and exercises

Learning Resources

Official Documentation

Recommended Learning Path

Week 1-2: Data manipulation with pandas
Week 3-4: Preprocessing and pipelines
Week 5-6: Feature engineering
Week 7-8: Model evaluation and selection

Version: 1.4.0 | Last Updated: 2025-01-01 | Status: Production Ready

01-ml-fundamentals