primitive-rmsnorm | curry-train | ClaudePluginHub

Skill

primitive-rmsnorm

From curry-train

Root Mean Square LayerNorm — drop the mean-subtraction from LayerNorm, keep only the RMS-based scaling. Used by Llama, Qwen, and most modern LLMs. Activate when the user asks "RMSNorm", "Llama norm", "RMS layer norm", "skip mean centering", or compares RMSNorm vs LayerNorm.

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A simplified LayerNorm: divide by the root-mean-square of the activations along the feature dimension, then scale by a learnable per-feature weight. No mean subtraction, no learnable bias. Standard component in Llama-family models.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · RMSNorm

A simplified LayerNorm: divide by the root-mean-square of the activations along the feature dimension, then scale by a learnable per-feature weight. No mean subtraction, no learnable bias. Standard component in Llama-family models.

What it does

y = x / sqrt(mean(x²) + eps) * gamma

where gamma is a learnable per-feature scale (initialized to 1).

vs LayerNorm:

LayerNorm: (x - mean(x)) / sqrt(var(x) + eps) * gamma + beta
RMSNorm: drop mean(x) and beta.

The simplification:

Cheaper (one fewer reduction, no bias).
Empirically equivalent to LayerNorm for transformers, sometimes slightly better.
Standard in Llama, Mistral, Qwen, DeepSeek.

Interface (V1 stub)

from curry_train.primitives import RMSNorm

norm = RMSNorm(d_model=2048, eps=1e-6)
y = norm(x)   # shape preserved

When to use

New transformer model in 2024+: default to RMSNorm.
Compatibility with Llama-family pretrained weights: required.

When NOT to use

Loading old GPT-2 / BERT pretrained weights: those use LayerNorm; converting to RMSNorm is a quality regression.
ViT / older CV transformers: usually use LayerNorm; some recent work uses GroupNorm or BatchNorm. Use a LayerNorm primitive for those.
SNN models in this codebase: CSLA-MT uses BatchNorm1d — different primitive entirely.

Implementation hints

Compute mean(x², dim=-1, keepdim=True) in fp32 even when activations are bf16, to avoid underflow.
eps = 1e-6 is the Llama default; some codebases use 1e-5.
Backward is straightforward — PyTorch autograd handles it correctly with a naive implementation.

# Reference
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps
    def forward(self, x):
        norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
        return (x * norm.to(x.dtype)) * self.weight

Boundaries

RMSNorm is a per-feature norm. For per-channel norms (CV), use BatchNorm or GroupNorm — different primitives.
For very large d_model, the Welford-style numerical stability of LayerNorm doesn't apply; RMSNorm's single reduction is fine in fp32.

Implementation status

V1: stub at template/curry_train/primitives/rmsnorm.py. PyTorch 2.4+ has torch.nn.functional.rms_norm; HuggingFace's LlamaRMSNorm is the canonical reference.

Related

skills/primitive-gqattention — surrounded by RMSNorm in pre-norm transformer layout.
For LayerNorm / BatchNorm / GroupNorm primitives, see template/curry_train/primitives/norms.py (V2+ extension).
Zhang & Sennrich 2019, "Root Mean Square Layer Normalization".