Skill

stage2-overfit-single-batch

The canonical sanity check — train on 2-3 examples until the loss is near zero, proving the entire forward/backward/optimizer pipeline can fit. Activate when the user asks "how do I sanity check my model", "overfit one batch", "is my pipeline working", "loss is not decreasing", or before any real training of a fresh architecture.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The cheapest and highest-information sanity check in deep learning. If your model **cannot** drive the loss on 2–3 fixed examples to near zero, the pipeline is broken — there is no point trying to train on a real dataset.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 2 · Sanity · Overfit a single batch

The cheapest and highest-information sanity check in deep learning. If your model cannot drive the loss on 2–3 fixed examples to near zero, the pipeline is broken — there is no point trying to train on a real dataset.

Stage question

"Can the optimizer drive the training loss on a tiny fixed batch all the way to (near) zero?"

If yes: pipeline plumbing is correct. Move to Stage 3. If no: stop, fix the bug, do not try a larger run "to see what happens".

Procedure

Pick 2–3 examples. Use dummy_batch() from the model package, or take the first batch from the train loader and pin it. Save it to disk so successive runs are bit-identical.
Disable everything that adds noise.
- Dropout: switch the model to eval() mode for dropout layers, or set p=0.
- Augmentation: turn off all augmentations.
- Weight decay: set to 0 for this run.
- LR scheduler: use a constant LR (no warmup, no decay).
- Mixed precision: leave it on; if it breaks here, that's a bug worth catching.
- Distributed: run single-process. DDP/FSDP can hide bugs at this stage.
Train for ~500 steps on the same batch. No data shuffling, no new examples. Same forward, same backward, same optimizer step, repeated.
Plot or print the loss every step. It must decrease monotonically (with the usual jitter of momentum-based optimizers) and approach zero. For classification, "near zero" means < 0.05; for regression, < 1e-4 of the original loss; for autoregressive LM, < 0.5 if vocab is large.
If loss does not approach zero, you have a bug. Diagnose in this order before tweaking anything else:
- Is the optimizer actually stepping? (Check optimizer.state_dict()['state'] is non-empty after one step.)
- Are gradients flowing to all layers? (stage2-grad-flow-viz).
- Is the loss function correct? (e.g. soft-maxed logits into CrossEntropyLoss is a double-softmax bug.)
- Is the model capacity sufficient? Even a tiny model should overfit 2 examples; if it cannot, the architecture has a fundamental bug (e.g. a bottleneck layer with too few units).

Why this works

Generalization is hard; memorization of 2 examples is trivial. A correct pipeline can always memorize. A broken pipeline often fails first at memorization, and the failure mode is informative: it tells you the bug is in plumbing, not in hyperparameters.

If you skip this step and go straight to a real training run, you may spend hours waiting to discover that, e.g., your loss function never reduces — and you won't be able to tell whether the model is "learning slowly" or "broken".

Procedure when assisting a user

Confirm the user has a dummy_batch() or a saved fixed batch. If not, help them create one (just save the first real batch to a .pt file).
Help them produce a 30-line script that loads the fixed batch, runs the loop, and plots the loss. The simpler the script, the more useful.
After running, ask them to share the loss trajectory. If it didn't go near zero, walk through the diagnosis order above.
If the loss did go near zero, congratulate them — they're through Stage 2's most important check. Suggest stage2-init-loss-check and stage2-grad-flow-viz as the remaining two checks.

Boundaries

This is not a generalization test. Loss going to zero on 2 examples says nothing about whether the model will generalize. It only says the pipeline is correct.
This is not a hyperparameter tuning step. Use the simplest possible LR (e.g. 1e-3 with Adam). The point is correctness, not optimality.
It does not validate the loss function's appropriateness. A wrong loss function can still be minimized — e.g. minimizing MSE for a classification problem will give you near-zero loss but a bad classifier.

Common failure modes

Loss flatlines at the initial value. Optimizer is not stepping (zero_grad bug, or requires_grad=False somewhere) or LR is way too low.
Loss decreases then explodes. LR too high; reduce by 10×.
Loss decreases to a plateau well above zero. Capacity too low (very rare for 2 examples), or the loss has an unreachable minimum (e.g. label noise hardcoded in the dummy batch).
NaN within the first few steps. Numerical issue — likely missing layer norm, or a softmax with very large logits at init.

skills/stage2-init-loss-check — verify the initial loss is sane before training starts.
skills/stage2-grad-flow-viz — if overfit fails, look for dead layers.
skills/stage1-preflight-asserts — should have caught most of these before this stage.
Karpathy 2019, "A Recipe for Training Neural Networks", §3 ("Overfit").

stage2-overfit-single-batch

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage2-overfit-single-batch

Tool Access

Preview

SKILL.md

Stage 2 · Sanity · Overfit a single batch

Stage question

Procedure

Why this works

Procedure when assisting a user

Boundaries

Common failure modes

Related

Similar Skills

Help us improve

Stage 2 · Sanity · Overfit a single batch

Stage question

Procedure

Why this works

Procedure when assisting a user

Boundaries

Common failure modes

Related