Skill

stage3-mup-coord-check

Verify that activation statistics and gradient magnitudes are width-invariant under muP parameterization, so that hyperparameters tuned at small width transfer zero-shot to large width. Activate when the user asks "muP", "muTransfer", "tune small predict big", "coord check", "width-invariant init", or before any large-scale training where they want to skip per-width hyperparameter tuning.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A diagnostic that proves your model is muP-compliant — meaning hyperparameters (LR, init scale, multipliers) tuned at, say, width 256 transfer zero-shot to width 4096. This is **the** way to avoid re-tuning at every model size.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 3 · Pre-validate · muP coord check

A diagnostic that proves your model is muP-compliant — meaning hyperparameters (LR, init scale, multipliers) tuned at, say, width 256 transfer zero-shot to width 4096. This is the way to avoid re-tuning at every model size.

Stage question

"If I tune learning rate at a small width, will the same learning rate be optimal at a 16× larger width?"

If muP coord check passes: yes, and you can save enormous compute. If it fails: tuning at small scale predicts nothing, and you are back to per-width sweeps.

What is muP (Tensor Programs V)

muP (maximal update parameterization) is a set of init-scale and learning-rate multipliers that keep activations and updates O(1) per coordinate as width grows. The key consequence:

A model in standard PyTorch parameterization has activation/update magnitudes that drift with width — so per-width tuning is required.
A model in muP has them stable across widths — so tuning at one width transfers to others.

The coord check is the test that confirms muP is correctly implemented for your specific model.

The coord check procedure

Build the model at two widths: a "small" width (e.g. 256) and a "wide" width (e.g. 1024 or 2048). Use the same muP-recommended hyperparameters for both.
For a few training steps (typically 2–5), record activation statistics:
- Mean and std of forward activations at each layer.
- Magnitude of weight updates per layer.
Plot per-layer activation std (y-axis) versus layer index (x-axis), for both widths overlaid.
Pass criterion: the two curves overlap. The small-width curve and the wide-width curve are nearly identical.

If they do not overlap, muP is not correctly implemented in your model — and per-width transfer will not work.

What to check, layer by layer

For a transformer block at step k:

Embedding output std should be ~1 across widths.
Attention output std should be ~1 across widths.
MLP output std should be ~1 across widths.
Per-layer weight delta ‖W_k − W_0‖_F / ‖W_0‖_F should be similar across widths.

The derivative of these quantities with respect to width should be ~0. If activation std at width 1024 is 1.4× that at width 256, muP is broken.

Recommended implementation

Lives at template/curry_train/prevalidate/mup_coord.py. Sketch:

def coord_check(build_model_at_width, widths=(256, 512, 1024),
                num_steps=3, batch_size=8):
    """Run a few training steps at each width and record per-layer stats.

    Returns dict[width -> dict[layer_name -> dict[step -> stats]]].
    Caller plots width-overlaid curves.
    """
    results = {}
    for w in widths:
        model = build_model_at_width(w)
        opt = ...  # muP-aware optimizer; see Tensor Programs V appendix
        records = defaultdict(dict)
        # register hooks to record per-layer activation std
        for step in range(num_steps):
            x, y = make_batch(batch_size)
            opt.zero_grad()
            out = model(x)
            loss = loss_fn(out, y)
            loss.backward()
            opt.step()
            for name, mod in model.named_modules():
                ...  # record activation std at this step
        results[w] = records
    return results

The reference for what build_model_at_width should look like is the Tensor Programs V paper (Yang et al. 2022) and EleutherAI's practitioner blog post.

Procedure when assisting a user

Confirm the user's model is intended to be muP-parameterized. If they're using standard PyTorch init, don't run a coord check — instead point them to the muP literature first.
Have them factor their model so that the architecture is parameterized by width: build_model(width=...). Most transformers can be parameterized this way without code surgery.
Run the coord check at three widths (256, 512, 1024). Plot per-layer activation std for each width overlaid.
Pass criterion: width-1024 curve is within ~10% of width-256 curve at every layer index. If not, list the layers where they diverge — those layers are the bug.
Common bugs surfaced:
- Embedding scale missing the muP 1/sqrt(d) factor.
- Attention Q^T K / sqrt(d) missing or using wrong base width.
- Output projection missing muP unembedding scale.
- LR not differentiated between matrix and non-matrix parameters.

Why this matters in this stage

Stage 3 is about proving an idea is worth scaling. If the user wants to use small-scale ablations to predict large-scale behavior, the only honest way is muP-correct training. Otherwise, small-scale results don't predict large-scale.

Coord check is the gate: if it fails, scaling decisions based on small-scale numbers are unreliable.

Boundaries

muP applies to width transfer, not depth or context length. Depth scaling has its own theory and is less mature.
muP requires architecture-aware multipliers; not every model class has muP recommendations published. For SNNs and unconventional architectures, the coord check still works but you may need to derive the multipliers yourself.
muP does not eliminate per-batch-size or per-data tuning. It only handles the width axis cleanly.

Common mistakes

Forgetting to apply muP multipliers in the optimizer (separate LR for input vs hidden vs output layers).
Mixing standard parameterization at the embedding with muP elsewhere.
Running the coord check at only one width — that doesn't test transfer.
Comparing at step 0 only — width drift takes a few steps to surface.

skills/stage3-scaling-fit — fits the predicted-vs-actual curve at multiple sizes; muP makes the prediction defensible.
skills/stage2-grad-flow-viz — informal predecessor of coord check.
Yang et al. (2022). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer". arXiv:2203.03466.
EleutherAI blog: "A practitioner's guide to muP".
template/curry_train/prevalidate/mup_coord.py.

stage3-mup-coord-check

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage3-mup-coord-check

Tool Access

Preview

SKILL.md

Stage 3 · Pre-validate · muP coord check

Stage question

What is muP (Tensor Programs V)

The coord check procedure

What to check, layer by layer

Recommended implementation

Procedure when assisting a user

Why this matters in this stage

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 3 · Pre-validate · muP coord check

Stage question

What is muP (Tensor Programs V)

The coord check procedure

What to check, layer by layer

Recommended implementation

Procedure when assisting a user

Why this matters in this stage

Boundaries

Common mistakes

Related