Supervised Learning Skill

Build, tune, and evaluate classification and regression models.

Quick Start

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

Key Topics

1. Classification Algorithms

Algorithm	Best For	Complexity
Logistic Regression	Baseline, interpretable	O(n*d)
Random Forest	Tabular, general	O(ndtrees)
XGBoost	Competitions, accuracy	O(ndtrees)
SVM	High-dim, small data	O(n²)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

classifiers = {
    'lr': LogisticRegression(max_iter=1000, class_weight='balanced'),
    'rf': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
    'xgb': XGBClassifier(n_estimators=100, eval_metric='logloss')
}

2. Regression Algorithms

Algorithm	Best For	Key Param
Ridge	Multicollinearity	alpha
Lasso	Feature selection	alpha
Random Forest	Non-linear	n_estimators
XGBoost	Best accuracy	learning_rate

3. Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")

4. Handling Class Imbalance

Technique	Implementation
Class Weights	`class_weight='balanced'`
SMOTE	`imblearn.over_sampling.SMOTE()`
Threshold Tuning	Adjust prediction threshold

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

5. Model Comparison

from sklearn.model_selection import cross_validate
import pandas as pd

def compare_models(models, X, y, cv=5):
    results = []
    for name, model in models.items():
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted'],
            return_train_score=True
        )
        results.append({
            'model': name,
            'train_acc': cv_results['train_accuracy'].mean(),
            'test_acc': cv_results['test_accuracy'].mean(),
            'test_f1': cv_results['test_f1_weighted'].mean(),
            'test_auc': cv_results['test_roc_auc_ovr_weighted'].mean()
        })
    return pd.DataFrame(results).round(4)

Best Practices

DO

Start with a simple baseline
Use stratified splits for classification
Log all hyperparameters
Check for overfitting (train vs test gap)
Use early stopping for boosting

DON'T

Don't tune on test set
Don't ignore class imbalance
Don't skip feature importance analysis
Don't use accuracy for imbalanced data

Exercises

Exercise 1: Model Selection

# TODO: Compare 3 different classifiers using cross-validation
# Report F1 score for each

Exercise 2: Hyperparameter Tuning

# TODO: Use RandomizedSearchCV to tune XGBoost
# Find optimal n_estimators, max_depth, learning_rate

Unit Test Template

import pytest
from sklearn.datasets import make_classification

def test_classifier_trains():
    """Test classifier can fit and predict."""
    X, y = make_classification(n_samples=100, random_state=42)
    model = get_classifier()

    model.fit(X[:80], y[:80])
    predictions = model.predict(X[80:])

    assert len(predictions) == 20
    assert set(predictions).issubset({0, 1})

def test_handles_imbalance():
    """Test model handles imbalanced classes."""
    X, y = make_classification(n_samples=100, weights=[0.9, 0.1])
    model = get_balanced_classifier()

    model.fit(X, y)
    predictions = model.predict(X)

    # Should predict both classes
    assert len(set(predictions)) == 2

Troubleshooting

Problem	Cause	Solution
Overfitting	Model too complex	Reduce depth, add regularization
Underfitting	Model too simple	Increase complexity
Class imbalance	Skewed data	Use SMOTE or class weights
Slow training	Large data	Use LightGBM, reduce estimators

Related Resources

Agent: 02-supervised-learning
Previous: ml-fundamentals
Next: clustering

Version: 1.4.0 | Status: Production Ready

supervised-learning

Supervised Learning Skill

Quick Start

Key Topics

1. Classification Algorithms

2. Regression Algorithms

3. Hyperparameter Tuning

4. Handling Class Imbalance

5. Model Comparison

Best Practices

DO

DON'T

Exercises

Exercise 1: Model Selection

Exercise 2: Hyperparameter Tuning

Unit Test Template

Troubleshooting

Related Resources

Similar Skills