From curry-train
Verify the loss at step 0 matches the value implied by a uniform-random model — typically -log(1/C) for C-way cross-entropy. Catches initialization bugs, double-softmax, missing bias init, and hidden activation issues. Activate when the user asks "what should my initial loss be", "init loss seems wrong", "is my model initialized correctly", or right after building a new model.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A 1-second check that catches a class of subtle bugs you'd otherwise discover hours into training.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A 1-second check that catches a class of subtle bugs you'd otherwise discover hours into training.
"Does the loss at step 0 match what an uninformative model would produce?"
For a freshly initialized model that has not seen data, the loss should be close to the value an uninformative (uniform) model would produce. The exact value depends on the loss:
| Loss | Expected init value |
|---|---|
Cross-entropy over C classes (uniform softmax) | -log(1/C) = log(C) |
| Binary cross-entropy (uniform sigmoid) | log(2) ≈ 0.693 |
| MSE on standardized targets | mean of target² ≈ 1.0 if targets are unit-variance |
| MAE | 0.5 × mean( |
| InfoNCE (contrastive, batch size N) | log(N) |
Token-level LM cross-entropy with vocab V | log(V) |
If the observed initial loss is more than 2× higher than the expected value, the model is producing very confident wrong predictions at init — usually because:
CrossEntropyLoss, which expects raw logits).If the observed initial loss is lower than expected, the model has somehow already encoded prior information — which usually means a bug in the data loader (label leakage) or in initialization.
def check_init_loss(model, dummy_batch, loss_fn, num_classes):
import math
model.eval()
with torch.no_grad():
logits = model(dummy_batch.x)
loss = loss_fn(logits, dummy_batch.y).item()
expected = math.log(num_classes)
ratio = loss / expected
if not (0.7 < ratio < 1.5):
raise AssertionError(
f"init loss {loss:.4f} ≠ expected {expected:.4f} "
f"(ratio {ratio:.2f}). See stage2-init-loss-check skill."
)
This is implemented as assert_init_loss_is_reasonable in template/curry_train/infra/preflight.py.
If the dataset is class-imbalanced (priors p_1, ..., p_C ≠ uniform), set the final-layer bias to log(p_i) so the init loss is lower than -log(1/C) and matches the cross-entropy of the prior:
final_layer.bias.data = torch.log(class_priors)
Init loss should then drop from log(C) to the entropy of the prior H(p).
Ask the user what loss they're using and how many classes (or vocab size) the head produces.
Compute the expected init loss from the table above.
Have the user run a single forward pass on dummy_batch (no backward, no optimizer step) and report the loss.
Compare. If within 2×, pass. If not:
If the user is doing class-imbalanced classification, suggest the final-layer bias trick.
-log(1/C). Don't enforce the equation; instead, sanity check that the loss is finite and not infinity.loss_fn accepts raw logits. Many bugs are caused by violating this assumption.skills/stage1-preflight-asserts (assert 4).skills/stage2-overfit-single-batch — the next thing to do if init loss looks fine.