npx claudepluginhub andikarachman/data-science-plugin --plugin dsThis skill uses the workspace's default tool permissions.
Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.
Splits datasets like CSV into training, validation, and test sets with ratios and stratification using Python for ML workflows. Activates on split dataset requests.
Guides structured ML analysis for clustering, classification, regression, time series forecasting, statistical testing, model comparison, and data analysis. Enforces markdown analysis cells after code and final summaries. Auto-activates on relevant keywords; hands off SQL to bigquery skill.
Enforces ML rigor: baseline comparisons vs dummy/linear models, cross-validation, interpretation, leakage prevention with sklearn templates.
Share bugs, ideas, or general feedback.
Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.
Yes -> Temporal split: Train on past, validate on recent, test on most recent. Never shuffle across time.
TimeSeriesSplit for cross-validationNo -> Continue to next question.
Examples: multiple rows per customer, multiple images per patient, repeated measurements.
Yes -> Group-aware split: Keep all observations of a group in the same fold.
GroupKFold or GroupShuffleSplitNo -> Continue.
Minority class <10% of total.
Yes -> Stratified split: Preserve class ratios across folds.
StratifiedKFold or StratifiedShuffleSplitStratifiedGroupKFoldNo -> Simple random split: Standard train_test_split with fixed seed.
Less than 5,000 rows.
Yes -> Cross-validation: Use 5-fold or 10-fold CV instead of a single holdout. Report mean and std of metrics.
No -> Single holdout is fine (70/15/15 or 80/10/10).
| Dataset Size | Recommended Split | Notes |
|---|---|---|
| <1,000 | Leave-one-out or 10-fold CV | Every data point matters |
| 1,000-10,000 | 5-fold CV or 80/10/10 | CV preferred for reliable estimates |
| 10,000-100,000 | 80/10/10 | Single holdout usually sufficient |
| >100,000 | 90/5/5 or 95/2.5/2.5 | Large test sets are unnecessary |
random_statefrom sklearn.model_selection import train_test_split, StratifiedKFold, GroupKFold, TimeSeriesSplit
# Simple stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Temporal split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
# Group-aware split
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=groups):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]