From curry-train
Methodology for building a leakage-safe data pipeline — split before preprocess, fit transforms on train only, time-aware splits for temporal data, deterministic shuffle. Activate when the user asks "how do I split my data", "data pipeline best practice", "is my normalizer leaking", "how to set up a dataset for curryTrain", or shows a pipeline that fits a transform on the full dataset.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The single most under-respected rule in deep learning: **fit every transform on training data only, ever**. Violating this rule is the most common silent way to produce results that don't replicate.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The single most under-respected rule in deep learning: fit every transform on training data only, ever. Violating this rule is the most common silent way to produce results that don't replicate.
"Could information from validation or test data have leaked into the training pipeline?"
If you cannot answer "no, by construction", you are not yet through Stage 1.
Every dataset adapter in curryTrain follows this order, always:
raw → split → fit_on_train_only → apply_to_all → load_loader
.fit(...) on the train split alone, then frozen.DataLoader with appropriate batch size, shuffle on train only, deterministic seed.If any of these steps is out of order, the pipeline is leaking.
For data with timestamps:
t < t_split_train_valt_split_train_val ≤ t < t_split_val_testt ≥ t_split_val_testNever do random_split on temporal data. Never use any future information (e.g. moving averages computed over all time) in a feature.
The dataloader for the train split shuffles. Use a Generator seeded with the run's seed so the shuffle is reproducible:
g = torch.Generator()
g.manual_seed(seed)
DataLoader(train_set, shuffle=True, generator=g, ...)
Validation and test loaders never shuffle.
This is assert_no_leak_in_data_pipeline from stage1-preflight-asserts. The probe:
val and apply to the same 10 train samples.This catches the common bug of StandardScaler().fit(full_dataset) instead of .fit(train).
Curriculum learning, data deduplication, quality filtering — these belong to Stage 4 Scale-up at the earliest. In Stage 1, just produce a clean, leakage-safe pipeline. Don't optimize the data ordering until you've proven the model works at all.
Ask the user for their split rule. If they cannot articulate it, that's the first problem to solve.
Inspect their pipeline code. Look for .fit( calls. Every .fit call must be on a train-only subset.
If they have temporal data, check that the split is time-aware. Random splits on temporal data are a frequent bug.
Wire up assert_no_leak_in_data_pipeline in their preflight. Run it once. Failures here block Stage 2.
Confirm the train DataLoader is the only one with shuffle=True and the only one using a seeded Generator.
scaler.fit(X_all) followed by per-split application → leakage. Re-do as scaler.fit(X_train).train_test_split(X, y, random_state=...) on temporal data → time leak.skills/stage1-preflight-asserts (assert 6 specifically).skills/stage2-overfit-single-batch — once leakage is ruled out, sanity check the model.