Skill

stage2-init-loss-check

Verify the loss at step 0 matches the value implied by a uniform-random model — typically -log(1/C) for C-way cross-entropy. Catches initialization bugs, double-softmax, missing bias init, and hidden activation issues. Activate when the user asks "what should my initial loss be", "init loss seems wrong", "is my model initialized correctly", or right after building a new model.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A 1-second check that catches a class of subtle bugs you'd otherwise discover hours into training.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 2 · Sanity · Init loss check

A 1-second check that catches a class of subtle bugs you'd otherwise discover hours into training.

Stage question

"Does the loss at step 0 match what an uninformative model would produce?"

The expected initial losses

For a freshly initialized model that has not seen data, the loss should be close to the value an uninformative (uniform) model would produce. The exact value depends on the loss:

Loss	Expected init value
Cross-entropy over `C` classes (uniform softmax)	`-log(1/C)` = `log(C)`
Binary cross-entropy (uniform sigmoid)	`log(2)` ≈ 0.693
MSE on standardized targets	mean of `target²` ≈ 1.0 if targets are unit-variance
MAE	0.5 × mean(
InfoNCE (contrastive, batch size N)	`log(N)`
Token-level LM cross-entropy with vocab `V`	`log(V)`

If the observed initial loss is more than 2× higher than the expected value, the model is producing very confident wrong predictions at init — usually because:

The final-layer bias was not initialized to the data prior.
Logits at init are extremely large (poor init scaling, or layers that scale with depth).
A softmax was applied somewhere it shouldn't be (e.g. before CrossEntropyLoss, which expects raw logits).
An activation produces values outside the loss function's valid domain.

If the observed initial loss is lower than expected, the model has somehow already encoded prior information — which usually means a bug in the data loader (label leakage) or in initialization.

Recommended check

def check_init_loss(model, dummy_batch, loss_fn, num_classes):
    import math
    model.eval()
    with torch.no_grad():
        logits = model(dummy_batch.x)
        loss = loss_fn(logits, dummy_batch.y).item()
    expected = math.log(num_classes)
    ratio = loss / expected
    if not (0.7 < ratio < 1.5):
        raise AssertionError(
            f"init loss {loss:.4f} ≠ expected {expected:.4f} "
            f"(ratio {ratio:.2f}). See stage2-init-loss-check skill."
        )

This is implemented as assert_init_loss_is_reasonable in template/curry_train/infra/preflight.py.

Final-layer bias trick

If the dataset is class-imbalanced (priors p_1, ..., p_C ≠ uniform), set the final-layer bias to log(p_i) so the init loss is lower than -log(1/C) and matches the cross-entropy of the prior:

final_layer.bias.data = torch.log(class_priors)

Init loss should then drop from log(C) to the entropy of the prior H(p).

Procedure when assisting a user

Ask the user what loss they're using and how many classes (or vocab size) the head produces.
Compute the expected init loss from the table above.
Have the user run a single forward pass on dummy_batch (no backward, no optimizer step) and report the loss.
Compare. If within 2×, pass. If not:
- Ratio > 2: walk through the four causes above in order.
- Ratio < 0.7: check the data loader for label leakage and the model for a hidden warmup / pretrained init.
If the user is doing class-imbalanced classification, suggest the final-layer bias trick.

Boundaries

This check is not about generalization. A model with a perfect init loss can still be terrible.
This check is not appropriate for finetuning a pretrained model — for a pretrained model, the init loss should already be much lower than -log(1/C). Don't enforce the equation; instead, sanity check that the loss is finite and not infinity.
This check assumes loss_fn accepts raw logits. Many bugs are caused by violating this assumption.

Common bugs surfaced

Init loss == log(C) on a pretrained model → the pretrained weights were not actually loaded.
Init loss == 0 → labels accidentally leaked as inputs, or the loss function is degenerate.
Init loss == ∞ or NaN → softmax saturating at large logits → too-large init scale.
Init loss ≈ 5 × log(C) → final-layer init too large + no LayerNorm at the head.

skills/stage1-preflight-asserts (assert 4).
skills/stage2-overfit-single-batch — the next thing to do if init loss looks fine.
Karpathy 2019, "A Recipe for Training Neural Networks", §2 step 5.

stage2-init-loss-check

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage2-init-loss-check

Tool Access

Preview

SKILL.md

Stage 2 · Sanity · Init loss check

Stage question

The expected initial losses

Recommended check

Final-layer bias trick

Procedure when assisting a user

Boundaries

Common bugs surfaced

Related

Similar Skills

Help us improve

Stage 2 · Sanity · Init loss check

Stage question

The expected initial losses

Recommended check

Final-layer bias trick

Procedure when assisting a user

Boundaries

Common bugs surfaced

Related