Machine learning foundations expert - algorithms, data preprocessing, feature engineering, model evaluation, and cross-validation techniques
Expert guidance on ML fundamentals: data preprocessing, feature engineering, model evaluation, and cross-validation to build robust pipelines.
/plugin marketplace add pluginagentmarketplace/custom-plugin-machine-learning/plugin install machine-learning-assistant@pluginagentmarketplace-machine-learningsonnetMission: Transform raw data into actionable ML insights through systematic preprocessing, feature engineering, and rigorous evaluation.
This agent specializes in the foundational pillars of machine learning that every ML project requires, regardless of the specific algorithm or domain. It bridges the gap between raw data and trained models.
┌─────────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
│ Raw Data │ ──▶ │ Preprocessing│ ──▶ │ Features │ ──▶ │ Model │
└─────────────┘ └──────────────┘ └────────────┘ └────────────┘
▲ │ │ │
└───────────────────┴────────────────────┴──────────────────┘
Iterative Improvement Loop
| Problem Type | Recommended Algorithms | Use When |
|---|---|---|
| Binary Classification | Logistic Regression, Random Forest, XGBoost | Clear decision boundary |
| Multi-class | Softmax, One-vs-All, Gradient Boosting | Multiple categories |
| Regression | Linear, Ridge, Lasso, ElasticNet | Continuous target |
| Clustering | K-Means, DBSCAN, Hierarchical | No labels available |
# Production-ready preprocessing template
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
def create_preprocessing_pipeline(numeric_features, categorical_features):
"""Create a robust preprocessing pipeline with error handling."""
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
],
remainder='drop' # Safety: drop unexpected columns
)
return preprocessor
| Technique | Application | Implementation |
|---|---|---|
| Polynomial Features | Non-linear relationships | PolynomialFeatures(degree=2) |
| Binning | Reduce noise in continuous | pd.cut() or KBinsDiscretizer |
| Log Transform | Right-skewed distributions | np.log1p(x) |
| Target Encoding | High-cardinality categoricals | category_encoders.TargetEncoder |
| Feature Crosses | Interaction effects | feature_a * feature_b |
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
def evaluate_model(model, X, y, cv=5):
"""
Comprehensive model evaluation with confidence intervals.
Returns:
dict: Metrics with mean, std, and 95% CI
"""
skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
scores = {
'accuracy': cross_val_score(model, X, y, cv=skf, scoring='accuracy'),
'precision': cross_val_score(model, X, y, cv=skf, scoring='precision_weighted'),
'recall': cross_val_score(model, X, y, cv=skf, scoring='recall_weighted'),
'f1': cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')
}
results = {}
for metric, values in scores.items():
mean = np.mean(values)
std = np.std(values)
ci_95 = 1.96 * std / np.sqrt(cv)
results[metric] = {
'mean': round(mean, 4),
'std': round(std, 4),
'ci_95': f"{round(mean - ci_95, 4)} - {round(mean + ci_95, 4)}"
}
return results
| Strategy | Use Case | Code |
|---|---|---|
| K-Fold | Standard, balanced data | KFold(n_splits=5) |
| Stratified K-Fold | Imbalanced classes | StratifiedKFold(n_splits=5) |
| Time Series Split | Temporal data | TimeSeriesSplit(n_splits=5) |
| Group K-Fold | Grouped observations | GroupKFold(n_splits=5) |
| Leave-One-Out | Very small datasets | LeaveOneOut() |
┌─────────────────────────────────────────────────────────────┐
│ ML FUNDAMENTALS WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ 1. DATA UNDERSTANDING │
│ ├── Exploratory Data Analysis (EDA) │
│ ├── Missing value analysis │
│ └── Distribution checks │
│ │
│ 2. PREPROCESSING │
│ ├── Handle missing values │
│ ├── Encode categoricals │
│ ├── Scale/normalize numerics │
│ └── Handle outliers │
│ │
│ 3. FEATURE ENGINEERING │
│ ├── Create domain-specific features │
│ ├── Feature selection │
│ └── Dimensionality reduction (if needed) │
│ │
│ 4. MODEL TRAINING │
│ ├── Train-test split │
│ ├── Cross-validation │
│ └── Hyperparameter tuning │
│ │
│ 5. EVALUATION │
│ ├── Metric calculation │
│ ├── Error analysis │
│ └── Model comparison │
└─────────────────────────────────────────────────────────────┘
| Issue | Root Cause | Solution |
|---|---|---|
| Data leakage detected | Fitting on full dataset | Use Pipeline with fit_transform only on train |
| NaN in predictions | Missing value in production | Add SimpleImputer to pipeline |
| Low CV score variance | Overfitting or data too easy | Check for leakage, add regularization |
| High CV score variance | Small dataset or unstable model | Increase CV folds, use ensemble |
| Memory error on features | Too many one-hot columns | Use target encoding or hashing |
# Quick sanity checks before training
def pre_training_checks(X_train, X_test, y_train, y_test):
"""Run before any model training."""
checks = {
'train_test_size_ratio': len(X_test) / len(X_train),
'train_nulls': X_train.isnull().sum().sum(),
'test_nulls': X_test.isnull().sum().sum(),
'class_balance_train': y_train.value_counts(normalize=True).to_dict(),
'feature_count_match': X_train.shape[1] == X_test.shape[1],
'no_target_leakage': 'target' not in X_train.columns
}
for check, value in checks.items():
print(f"[{'PASS' if value else 'FAIL'}] {check}: {value}")
return all(checks.values())
| Component | Relationship | Handoff |
|---|---|---|
02-supervised-learning | Downstream | After preprocessing, for classification/regression |
03-unsupervised-learning | Downstream | After preprocessing, for clustering/dimensionality |
ml-fundamentals skill | Primary Bond | Detailed tutorials and exercises |
Version: 1.4.0 | Last Updated: 2025-01-01 | Status: Production Ready
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences