From curry-train
Per-layer histograms of gradient magnitudes and activation statistics — used to detect dead layers, exploding gradients, or pathological depth scaling early. Activate when the user asks "is my gradient flow healthy", "are any layers dead", "exploding gradients", "vanishing gradients", or "what's a coord check" (related to muP).
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A few-second-per-step diagnostic that surfaces dead layers, exploding gradients, and depth-scaling pathologies before they cause silent training failures.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A few-second-per-step diagnostic that surfaces dead layers, exploding gradients, and depth-scaling pathologies before they cause silent training failures.
"Are gradients flowing through every layer at the right scale?"
For each parameter (or each named module), at the first training step (or every K steps as a watchdog), record:
| Quantity | What it tells you |
|---|---|
‖p.grad‖₂ (per parameter) | Whether the layer is learning at all. Zero → dead layer. |
‖p.grad‖₂ / ‖p‖₂ (per parameter) | Relative update size. Healthy = O(1e-3 to 1e-1). |
mean(activation), std(activation) (per module forward output) | Whether activations stay O(1) across depth. |
| `max( | activation |
Plot or table these by layer index. Two visual signals are diagnostic:
‖p.grad‖ shrinks by orders of magnitude with depth. Likely missing residual connections, or wrong init.‖p.grad‖ grows with depth. Likely missing layer norms, or activation function with no bounded output.A healthy modern transformer has activation std stable across depth (the "coord check" property in muP).
For models that intend to use muP for scale transfer (stage3-mup-coord-check), the per-coordinate statistics must be width-invariant. Specifically: training a width-256 model and a width-1024 model with muP should give the same activation std at every layer index, after the first few steps.
If this is not the case, muP transfer will not work. The grad flow visualization is the diagnostic that surfaces the problem.
class GradFlowProbe:
def __init__(self, model):
self.records = []
for name, p in model.named_parameters():
p.register_hook(lambda g, n=name: self._record_grad(n, g))
for name, m in model.named_modules():
m.register_forward_hook(
lambda mod, inp, out, n=name: self._record_act(n, out)
)
def _record_grad(self, name, g):
self.records.append(("grad", name, g.norm().item()))
def _record_act(self, name, out):
if isinstance(out, torch.Tensor):
self.records.append(("act", name, out.std().item(), out.mean().item()))
Implementations live at template/curry_train/infra/preflight.py:probe_grad_flow.
Wire up the probe to a single training step (forward + backward only — no optimizer step needed).
Group results by layer index. For a transformer, "layer index" is the depth of each block.
Plot grad-norm vs layer index on a log scale. Plot activation std vs layer index on a linear scale.
Diagnosis:
optimizer.param_groups. See stage1-preflight-asserts assert 8.nn.init overrides.If user is targeting muP, confirm activation std is flat across depth within a single width, and flat across widths after a few steps. If not, the model is not muP-compliant.
nn.LayerNorm after attention → activation std drift with depth.requires_grad=False) leaked into a "learning" config → dead layer.skills/stage1-preflight-asserts (assert 5).skills/stage2-overfit-single-batch — should pass if grad flow is healthy.skills/stage3-mup-coord-check — formal muP coord-check, builds on this skill.