From curry-train
Verify that activation statistics and gradient magnitudes are width-invariant under muP parameterization, so that hyperparameters tuned at small width transfer zero-shot to large width. Activate when the user asks "muP", "muTransfer", "tune small predict big", "coord check", "width-invariant init", or before any large-scale training where they want to skip per-width hyperparameter tuning.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A diagnostic that proves your model is muP-compliant — meaning hyperparameters (LR, init scale, multipliers) tuned at, say, width 256 transfer zero-shot to width 4096. This is **the** way to avoid re-tuning at every model size.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A diagnostic that proves your model is muP-compliant — meaning hyperparameters (LR, init scale, multipliers) tuned at, say, width 256 transfer zero-shot to width 4096. This is the way to avoid re-tuning at every model size.
"If I tune learning rate at a small width, will the same learning rate be optimal at a 16× larger width?"
If muP coord check passes: yes, and you can save enormous compute. If it fails: tuning at small scale predicts nothing, and you are back to per-width sweeps.
muP (maximal update parameterization) is a set of init-scale and learning-rate multipliers that keep activations and updates O(1) per coordinate as width grows. The key consequence:
The coord check is the test that confirms muP is correctly implemented for your specific model.
If they do not overlap, muP is not correctly implemented in your model — and per-width transfer will not work.
For a transformer block at step k:
‖W_k − W_0‖_F / ‖W_0‖_F should be similar across widths.The derivative of these quantities with respect to width should be ~0. If activation std at width 1024 is 1.4× that at width 256, muP is broken.
Lives at template/curry_train/prevalidate/mup_coord.py. Sketch:
def coord_check(build_model_at_width, widths=(256, 512, 1024),
num_steps=3, batch_size=8):
"""Run a few training steps at each width and record per-layer stats.
Returns dict[width -> dict[layer_name -> dict[step -> stats]]].
Caller plots width-overlaid curves.
"""
results = {}
for w in widths:
model = build_model_at_width(w)
opt = ... # muP-aware optimizer; see Tensor Programs V appendix
records = defaultdict(dict)
# register hooks to record per-layer activation std
for step in range(num_steps):
x, y = make_batch(batch_size)
opt.zero_grad()
out = model(x)
loss = loss_fn(out, y)
loss.backward()
opt.step()
for name, mod in model.named_modules():
... # record activation std at this step
results[w] = records
return results
The reference for what build_model_at_width should look like is the Tensor Programs V paper (Yang et al. 2022) and EleutherAI's practitioner blog post.
Confirm the user's model is intended to be muP-parameterized. If they're using standard PyTorch init, don't run a coord check — instead point them to the muP literature first.
Have them factor their model so that the architecture is parameterized by width: build_model(width=...). Most transformers can be parameterized this way without code surgery.
Run the coord check at three widths (256, 512, 1024). Plot per-layer activation std for each width overlaid.
Pass criterion: width-1024 curve is within ~10% of width-256 curve at every layer index. If not, list the layers where they diverge — those layers are the bug.
Common bugs surfaced:
1/sqrt(d) factor.Q^T K / sqrt(d) missing or using wrong base width.Stage 3 is about proving an idea is worth scaling. If the user wants to use small-scale ablations to predict large-scale behavior, the only honest way is muP-correct training. Otherwise, small-scale results don't predict large-scale.
Coord check is the gate: if it fails, scaling decisions based on small-scale numbers are unreliable.
skills/stage3-scaling-fit — fits the predicted-vs-actual curve at multiple sizes; muP makes the prediction defensible.skills/stage2-grad-flow-viz — informal predecessor of coord check.template/curry_train/prevalidate/mup_coord.py.