From curry-train
Root Mean Square LayerNorm — drop the mean-subtraction from LayerNorm, keep only the RMS-based scaling. Used by Llama, Qwen, and most modern LLMs. Activate when the user asks "RMSNorm", "Llama norm", "RMS layer norm", "skip mean centering", or compares RMSNorm vs LayerNorm.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A simplified LayerNorm: divide by the root-mean-square of the activations along the feature dimension, then scale by a learnable per-feature weight. No mean subtraction, no learnable bias. Standard component in Llama-family models.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A simplified LayerNorm: divide by the root-mean-square of the activations along the feature dimension, then scale by a learnable per-feature weight. No mean subtraction, no learnable bias. Standard component in Llama-family models.
y = x / sqrt(mean(x²) + eps) * gamma
where gamma is a learnable per-feature scale (initialized to 1).
vs LayerNorm:
(x - mean(x)) / sqrt(var(x) + eps) * gamma + betamean(x) and beta.The simplification:
from curry_train.primitives import RMSNorm
norm = RMSNorm(d_model=2048, eps=1e-6)
y = norm(x) # shape preserved
LayerNorm primitive for those.mean(x², dim=-1, keepdim=True) in fp32 even when activations are bf16, to avoid underflow.eps = 1e-6 is the Llama default; some codebases use 1e-5.# Reference
class RMSNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(d_model))
self.eps = eps
def forward(self, x):
norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
return (x * norm.to(x.dtype)) * self.weight
d_model, the Welford-style numerical stability of LayerNorm doesn't apply; RMSNorm's single reduction is fine in fp32.V1: stub at template/curry_train/primitives/rmsnorm.py. PyTorch 2.4+ has torch.nn.functional.rms_norm; HuggingFace's LlamaRMSNorm is the canonical reference.
skills/primitive-gqattention — surrounded by RMSNorm in pre-norm transformer layout.template/curry_train/primitives/norms.py (V2+ extension).