From curry-train
The canonical sanity check — train on 2-3 examples until the loss is near zero, proving the entire forward/backward/optimizer pipeline can fit. Activate when the user asks "how do I sanity check my model", "overfit one batch", "is my pipeline working", "loss is not decreasing", or before any real training of a fresh architecture.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The cheapest and highest-information sanity check in deep learning. If your model **cannot** drive the loss on 2–3 fixed examples to near zero, the pipeline is broken — there is no point trying to train on a real dataset.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The cheapest and highest-information sanity check in deep learning. If your model cannot drive the loss on 2–3 fixed examples to near zero, the pipeline is broken — there is no point trying to train on a real dataset.
"Can the optimizer drive the training loss on a tiny fixed batch all the way to (near) zero?"
If yes: pipeline plumbing is correct. Move to Stage 3. If no: stop, fix the bug, do not try a larger run "to see what happens".
Pick 2–3 examples. Use dummy_batch() from the model package, or take the first batch from the train loader and pin it. Save it to disk so successive runs are bit-identical.
Disable everything that adds noise.
eval() mode for dropout layers, or set p=0.Train for ~500 steps on the same batch. No data shuffling, no new examples. Same forward, same backward, same optimizer step, repeated.
Plot or print the loss every step. It must decrease monotonically (with the usual jitter of momentum-based optimizers) and approach zero. For classification, "near zero" means < 0.05; for regression, < 1e-4 of the original loss; for autoregressive LM, < 0.5 if vocab is large.
If loss does not approach zero, you have a bug. Diagnose in this order before tweaking anything else:
optimizer.state_dict()['state'] is non-empty after one step.)stage2-grad-flow-viz).CrossEntropyLoss is a double-softmax bug.)Generalization is hard; memorization of 2 examples is trivial. A correct pipeline can always memorize. A broken pipeline often fails first at memorization, and the failure mode is informative: it tells you the bug is in plumbing, not in hyperparameters.
If you skip this step and go straight to a real training run, you may spend hours waiting to discover that, e.g., your loss function never reduces — and you won't be able to tell whether the model is "learning slowly" or "broken".
Confirm the user has a dummy_batch() or a saved fixed batch. If not, help them create one (just save the first real batch to a .pt file).
Help them produce a 30-line script that loads the fixed batch, runs the loop, and plots the loss. The simpler the script, the more useful.
After running, ask them to share the loss trajectory. If it didn't go near zero, walk through the diagnosis order above.
If the loss did go near zero, congratulate them — they're through Stage 2's most important check. Suggest stage2-init-loss-check and stage2-grad-flow-viz as the remaining two checks.
requires_grad=False somewhere) or LR is way too low.skills/stage2-init-loss-check — verify the initial loss is sane before training starts.skills/stage2-grad-flow-viz — if overfit fails, look for dead layers.skills/stage1-preflight-asserts — should have caught most of these before this stage.