Skill

stage1-data-pipeline

Methodology for building a leakage-safe data pipeline — split before preprocess, fit transforms on train only, time-aware splits for temporal data, deterministic shuffle. Activate when the user asks "how do I split my data", "data pipeline best practice", "is my normalizer leaking", "how to set up a dataset for curryTrain", or shows a pipeline that fits a transform on the full dataset.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The single most under-respected rule in deep learning: **fit every transform on training data only, ever**. Violating this rule is the most common silent way to produce results that don't replicate.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 1 · Skeleton · Data pipeline (leakage-safe)

The single most under-respected rule in deep learning: fit every transform on training data only, ever. Violating this rule is the most common silent way to produce results that don't replicate.

Stage question

"Could information from validation or test data have leaked into the training pipeline?"

If you cannot answer "no, by construction", you are not yet through Stage 1.

The four-step pipeline contract

Every dataset adapter in curryTrain follows this order, always:

raw → split → fit_on_train_only → apply_to_all → load_loader

Raw: load files, decode, but do nothing else. No filtering, no normalization, no tokenization.
Split: separate into train / val / test by some deterministic rule. For temporal data, time-aware: train < val < test by timestamp, never by random shuffle.
Fit on train only: any object that has parameters (normalizer, tokenizer, label encoder, vocabulary, feature selector) is .fit(...) on the train split alone, then frozen.
Apply to all: the fitted object is applied to all splits.
Loader: per-split DataLoader with appropriate batch size, shuffle on train only, deterministic seed.

If any of these steps is out of order, the pipeline is leaking.

Time-aware split rules (when applicable)

For data with timestamps:

Train: t < t_split_train_val
Val: t_split_train_val ≤ t < t_split_val_test
Test: t ≥ t_split_val_test

Never do random_split on temporal data. Never use any future information (e.g. moving averages computed over all time) in a feature.

Deterministic shuffle and seed

The dataloader for the train split shuffles. Use a Generator seeded with the run's seed so the shuffle is reproducible:

g = torch.Generator()
g.manual_seed(seed)
DataLoader(train_set, shuffle=True, generator=g, ...)

Validation and test loaders never shuffle.

Preflight check against leakage

This is assert_no_leak_in_data_pipeline from stage1-preflight-asserts. The probe:

Take 10 samples from the train pipeline. Record their transformed values.
Take 10 samples from the val pipeline. Record their transformed values.
Refit the transform on val and apply to the same 10 train samples.
If the train values changed, the transform was using val statistics → fail.

This catches the common bug of StandardScaler().fit(full_dataset) instead of .fit(train).

Curriculum / data ordering (later stages, not Stage 1)

Curriculum learning, data deduplication, quality filtering — these belong to Stage 4 Scale-up at the earliest. In Stage 1, just produce a clean, leakage-safe pipeline. Don't optimize the data ordering until you've proven the model works at all.

Procedure when assisting a user

Ask the user for their split rule. If they cannot articulate it, that's the first problem to solve.
Inspect their pipeline code. Look for .fit( calls. Every .fit call must be on a train-only subset.
If they have temporal data, check that the split is time-aware. Random splits on temporal data are a frequent bug.
Wire up assert_no_leak_in_data_pipeline in their preflight. Run it once. Failures here block Stage 2.
Confirm the train DataLoader is the only one with shuffle=True and the only one using a seeded Generator.

Boundaries

This skill is about correctness, not throughput. Mosaic Streaming, WebDataset, sharded readers — all of those belong to Stage 4 (Scale-up) when throughput matters.
This skill does not cover augmentation choices. Augmentation tuning is a Stage 3 (Pre-validate) concern.
This skill does not cover tokenization — see whatever the model package's tokenizer dependency is.

Common failure modes to flag

scaler.fit(X_all) followed by per-split application → leakage. Re-do as scaler.fit(X_train).
train_test_split(X, y, random_state=...) on temporal data → time leak.
Vocabulary built from train + val → val signal leaks into the embedding matrix.
Augmentations applied identically to train and val → val performance becomes biased.

skills/stage1-preflight-asserts (assert 6 specifically).
skills/stage2-overfit-single-batch — once leakage is ruled out, sanity check the model.
Andrew Ng's "Train/Dev/Test set distribution" chapter in Machine Learning Yearning.

stage1-data-pipeline

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage1-data-pipeline

Tool Access

Preview

SKILL.md

Stage 1 · Skeleton · Data pipeline (leakage-safe)

Stage question

The four-step pipeline contract

Time-aware split rules (when applicable)

Deterministic shuffle and seed

Preflight check against leakage

Curriculum / data ordering (later stages, not Stage 1)

Procedure when assisting a user

Boundaries

Common failure modes to flag

Related

Similar Skills

Help us improve

Stage 1 · Skeleton · Data pipeline (leakage-safe)

Stage question

The four-step pipeline contract

Time-aware split rules (when applicable)

Deterministic shuffle and seed

Preflight check against leakage

Curriculum / data ordering (later stages, not Stage 1)

Procedure when assisting a user

Boundaries

Common failure modes to flag

Related