Skill

stage2-grad-flow-viz

Per-layer histograms of gradient magnitudes and activation statistics — used to detect dead layers, exploding gradients, or pathological depth scaling early. Activate when the user asks "is my gradient flow healthy", "are any layers dead", "exploding gradients", "vanishing gradients", or "what's a coord check" (related to muP).

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A few-second-per-step diagnostic that surfaces dead layers, exploding gradients, and depth-scaling pathologies before they cause silent training failures.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 2 · Sanity · Gradient flow visualization

A few-second-per-step diagnostic that surfaces dead layers, exploding gradients, and depth-scaling pathologies before they cause silent training failures.

Stage question

"Are gradients flowing through every layer at the right scale?"

What to measure

For each parameter (or each named module), at the first training step (or every K steps as a watchdog), record:

Quantity	What it tells you
`‖p.grad‖₂` (per parameter)	Whether the layer is learning at all. Zero → dead layer.
`‖p.grad‖₂ / ‖p‖₂` (per parameter)	Relative update size. Healthy = O(1e-3 to 1e-1).
`mean(activation)`, `std(activation)` (per module forward output)	Whether activations stay O(1) across depth.
`max(	activation

Plot or table these by layer index. Two visual signals are diagnostic:

Monotonic vanishing: ‖p.grad‖ shrinks by orders of magnitude with depth. Likely missing residual connections, or wrong init.
Monotonic exploding: ‖p.grad‖ grows with depth. Likely missing layer norms, or activation function with no bounded output.

A healthy modern transformer has activation std stable across depth (the "coord check" property in muP).

The coord check (muP-style)

For models that intend to use muP for scale transfer (stage3-mup-coord-check), the per-coordinate statistics must be width-invariant. Specifically: training a width-256 model and a width-1024 model with muP should give the same activation std at every layer index, after the first few steps.

If this is not the case, muP transfer will not work. The grad flow visualization is the diagnostic that surfaces the problem.

Recommended hook

class GradFlowProbe:
    def __init__(self, model):
        self.records = []
        for name, p in model.named_parameters():
            p.register_hook(lambda g, n=name: self._record_grad(n, g))
        for name, m in model.named_modules():
            m.register_forward_hook(
                lambda mod, inp, out, n=name: self._record_act(n, out)
            )

    def _record_grad(self, name, g):
        self.records.append(("grad", name, g.norm().item()))

    def _record_act(self, name, out):
        if isinstance(out, torch.Tensor):
            self.records.append(("act", name, out.std().item(), out.mean().item()))

Implementations live at template/curry_train/infra/preflight.py:probe_grad_flow.

Procedure when assisting a user

Wire up the probe to a single training step (forward + backward only — no optimizer step needed).
Group results by layer index. For a transformer, "layer index" is the depth of each block.
Plot grad-norm vs layer index on a log scale. Plot activation std vs layer index on a linear scale.
Diagnosis:
- Flat-line at zero for some layer → dead layer. Check that the layer was actually included in optimizer.param_groups. See stage1-preflight-asserts assert 8.
- Vanishing toward output → missing residual connection in deeper blocks.
- Exploding toward output → missing post-block layer norm.
- Activation std growing 10× per layer → init scale too large; consider Xavier or Kaiming with nn.init overrides.
If user is targeting muP, confirm activation std is flat across depth within a single width, and flat across widths after a few steps. If not, the model is not muP-compliant.

Boundaries

This is a one-shot or low-frequency diagnostic, not a continuous logger. Running this every step adds non-trivial overhead.
Activation hooks may not be representative if the model uses gradient checkpointing — hooks fire during recomputation as well. Disable checkpointing for the probe pass.
This skill does not fix the architecture; it only surfaces problems. Architecture changes are a Stage 1 concern.

Common failure modes surfaced

Missing nn.LayerNorm after attention → activation std drift with depth.
Residual connection accidentally removed → vanishing gradients.
Too-large embedding init → first-layer activations dominate.
Frozen layer (requires_grad=False) leaked into a "learning" config → dead layer.

skills/stage1-preflight-asserts (assert 5).
skills/stage2-overfit-single-batch — should pass if grad flow is healthy.
skills/stage3-mup-coord-check — formal muP coord-check, builds on this skill.
Maxim Naumov's blog post on activation diagnostics.

stage2-grad-flow-viz

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage2-grad-flow-viz

Tool Access

Preview

SKILL.md

Stage 2 · Sanity · Gradient flow visualization

Stage question

What to measure

The coord check (muP-style)

Recommended hook

Procedure when assisting a user

Boundaries

Common failure modes surfaced

Related

Similar Skills

Help us improve

Stage 2 · Sanity · Gradient flow visualization

Stage question

What to measure

The coord check (muP-style)

Recommended hook

Procedure when assisting a user

Boundaries

Common failure modes surfaced

Related