From curry-train
A canonical set of low-cost assertions to run before any non-trivial training, catching the most common "silent" bugs (zero_grad missed, train/eval mode wrong, wrong tensor shape, label leakage in transforms). Activate when the user asks "what should I check before training", "preflight checks", "is my training set up correctly", or any time a fresh model is about to be trained.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A short, deterministic set of assertions that runs in a few seconds and catches the most common bugs that would otherwise waste hours of compute. These run **before the first real training step**, not as part of normal training.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A short, deterministic set of assertions that runs in a few seconds and catches the most common bugs that would otherwise waste hours of compute. These run before the first real training step, not as part of normal training.
"If something is silently wrong with my pipeline, will I know within 10 seconds?"
Each assert is one function that either passes silently or raises a structured PreflightError with a remediation hint. Implementations live in template/curry_train/infra/preflight.py.
assert_zero_grad_idempotent(model, optimizer)Run two consecutive optimizer.zero_grad(set_to_none=True) calls and confirm no parameter has a non-None .grad. Catches optimizers that hold dangling state.
assert_train_eval_mode_clean(model)Toggle model.train() and model.eval() once each, confirming model.training ends up where the caller asked. Catches modules that override .train()/.eval() incorrectly (common in custom modules with sub-optimizers).
assert_input_shape_contract(model, dummy_batch)Run one forward pass on a dummy_batch and confirm the output shape matches the model's documented contract. Catches silent broadcasting bugs (the most insidious source of wrong-but-not-erroring training).
assert_init_loss_is_reasonable(model, dummy_batch, loss_fn)For a classification head with C classes, the initial loss should be close to -log(1/C) (cross-entropy at uniform). Reject if the initial loss is more than 2× that value — usually means the final-layer bias was not initialized correctly, or a softmax was applied before the loss.
assert_grad_flows_to_inputs(model, dummy_batch, loss_fn)Backward once on a loss-of-output, then verify every nn.Parameter that requires gradient has a non-zero gradient. Catches dead layers and detached subgraphs.
assert_no_leak_in_data_pipeline(train_pipeline, val_pipeline)Probe: take 10 samples from train and val, confirm normalizer / tokenizer / feature-selector were fit on train only by checking that re-fitting on val would change them. See stage1-data-pipeline for the full procedure.
assert_dropout_off_in_eval(model)Forward the same input through model.eval() twice and confirm bit-exact equality of outputs. Catches stochastic layers that were left active in eval mode.
assert_optimizer_groups_cover_all_params(model, optimizer)Iterate over model.parameters() and confirm each one appears in some optimizer.param_groups[*].params. Catches the "I added a new module but forgot to put its params in the optimizer" bug.
from curry_train.infra.preflight import run_preflight
errors = run_preflight(
model=model,
optimizer=optimizer,
loss_fn=loss_fn,
dummy_batch=dummy_batch,
train_pipeline=train_pipeline,
val_pipeline=val_pipeline,
asserts="all", # or a list of names
)
if errors:
for e in errors:
print(f"[{e.code}] {e.message} → fix: {e.remediation}")
sys.exit(1)
run_preflight returns a list of PreflightErrors (empty on success) rather than raising on the first failure, so the user sees all problems at once.
If the user has not yet wired up run_preflight(), copy the boilerplate above into their training entry point right before the optimizer-step loop.
If dummy_batch is missing, point them at stage1-scaffolder — every model package should expose a dummy_batch() factory.
After the first run, walk through any failed asserts in the order listed above (1 before 2 before 3, etc.). The order is intentional: later asserts assume earlier ones pass.
Treat preflight failures as blocking. Do not proceed to Stage 2 until all eight pass.
optimizer.zero_grad() between batches → assert 1.tensor.detach() slipped in somewhere) → assert 5.skills/stage1-scaffolder — what to scaffold so the asserts have something to check.skills/stage2-overfit-single-batch — the next sanity step once preflight passes.template/curry_train/infra/preflight.py — the implementations.