Skill

split-strategy

From ds

Selects and implements train/validation/test split strategies based on data characteristics like time, groups, imbalance, and size. Guides sklearn usage for model evaluation frameworks.

Python

Pandas

ai-ml

npx claudepluginhub andikarachman/data-science-plugin --plugin ds

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.

SKILL.md

Similar Skills

splitting-datasets

1.9k

Splits datasets like CSV into training, validation, and test sets with ratios and stratification using Python for ML workflows. Activates on split dataset requests.

6 files6 tools

dataset-splitter

ml-best-practices

Guides structured ML analysis for clustering, classification, regression, time series forecasting, statistical testing, model comparison, and data analysis. Enforces markdown analysis cells after code and final summaries. Auto-activates on relevant keywords; hands off SQL to bigquery skill.

data-agent-kit-starter-pack

ml-rigor

152

Enforces ML rigor: baseline comparisons vs dummy/linear models, cross-validation, interpretation, leakage prevention with sklearn templates.

gyoshu

Stats

Stars11

Forks1

Last CommitFeb 24, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Split Strategy

Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.

Decision Tree

1. Is there a time dimension?

Yes -> Temporal split: Train on past, validate on recent, test on most recent. Never shuffle across time.

Use TimeSeriesSplit for cross-validation
Set embargo gap = largest feature look-back window

No -> Continue to next question.

2. Are observations grouped?

Examples: multiple rows per customer, multiple images per patient, repeated measurements.

Yes -> Group-aware split: Keep all observations of a group in the same fold.

Use GroupKFold or GroupShuffleSplit
Never let the same group appear in both train and test

No -> Continue.

3. Is the target imbalanced?

Minority class <10% of total.

Yes -> Stratified split: Preserve class ratios across folds.

Use StratifiedKFold or StratifiedShuffleSplit
Combine with group awareness if needed: StratifiedGroupKFold

No -> Simple random split: Standard train_test_split with fixed seed.

4. Is the dataset small?

Less than 5,000 rows.

Yes -> Cross-validation: Use 5-fold or 10-fold CV instead of a single holdout. Report mean and std of metrics.

No -> Single holdout is fine (70/15/15 or 80/10/10).

Split Ratios

Dataset Size	Recommended Split	Notes
<1,000	Leave-one-out or 10-fold CV	Every data point matters
1,000-10,000	5-fold CV or 80/10/10	CV preferred for reliable estimates
10,000-100,000	80/10/10	Single holdout usually sufficient
>100,000	90/5/5 or 95/2.5/2.5	Large test sets are unnecessary

Common Mistakes

Shuffling time-series data -- Destroys temporal structure, causes leakage
Fitting preprocessors before splitting -- Scalers, encoders must be fit only on training data
Using test set for tuning -- Test set should be touched only once, at the very end
Ignoring groups -- Correlated observations in different folds inflate performance estimates
Not setting random seed -- Results are not reproducible without random_state

Implementation

from sklearn.model_selection import train_test_split, StratifiedKFold, GroupKFold, TimeSeriesSplit

# Simple stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Temporal split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]

# Group-aware split
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=groups):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]