Skill

stage5-warmup-cosine

A standard warmup-then-cosine learning rate schedule that prevents early divergence and produces stable long-run training. Activate when the user asks "what learning rate schedule", "warmup", "cosine schedule", "no warmup is bad", "schedule diverges at start", or before any long run.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The default LR schedule for transformer training: a short linear warmup followed by a cosine decay to a small floor. Robust, well-understood, and almost always a sensible default.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 5 · Stabilize · Warmup + cosine schedule

The default LR schedule for transformer training: a short linear warmup followed by a cosine decay to a small floor. Robust, well-understood, and almost always a sensible default.

Stage question

"What LR schedule should I use for a long, stable run?"

The schedule

peak_lr → from `stage3-lr-range-test`, divided by 10
warmup  → linear from 0 to peak_lr over K_warmup steps
decay   → cosine from peak_lr to floor_lr (typically peak_lr / 10) over remaining steps
floor   → constant floor_lr after decay completes

Total schedule length = warmup + decay (+ optional flat tail).

Why warmup matters

At step 0 with Adam, the second-moment estimate v is small, so the effective step size is large for any nonzero gradient. Without warmup, this routinely produces:

Loss spikes in the first ~1000 steps.
NaN explosions on cold weights.
Adam state corruption that the run never recovers from.

Warmup (linear from 0) tames this. Aggressive Adam-based training without warmup is roulette.

Recommended warmup length

Small models (< 100M): ~500–1000 steps.
Medium (100M–1B): ~1000–2000 steps.
Large (> 1B): ~2000–5000 steps.

These are not magic numbers. The right warmup is "long enough that grad norm during warmup is stable". Watch grad norm in early steps; if it's bouncing, extend warmup.

Why cosine specifically

Cosine decay is empirically the most robust default among:

Linear decay: works but produces slightly worse final loss in most studies.
Step decay: requires picking step boundaries; brittle.
Inverse-sqrt: over-decays late.
Constant: can produce noise oscillations at the end of training; bad for final eval.

Cosine has no hyperparameters of its own (just two endpoints), is monotonically decreasing, and gives the model time at the floor LR for fine convergence.

Recommended implementation

import math
from torch.optim.lr_scheduler import LambdaLR

def warmup_cosine_schedule(optimizer, peak_lr, total_steps, warmup_steps,
                           floor_ratio=0.1):
    """Multiplier-based schedule: returns LambdaLR.

    multiplier(step) = step / warmup_steps    if step < warmup_steps
                     = floor + (1 - floor) * 0.5 * (1 + cos(progress * pi))
                       where progress = (step - warmup_steps) / (total_steps - warmup_steps)
    """
    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        if step >= total_steps:
            return floor_ratio
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return floor_ratio + (1 - floor_ratio) * 0.5 * (1 + math.cos(progress * math.pi))
    return LambdaLR(optimizer, lr_lambda)

The peak_lr is set as the optimizer's lr argument; the scheduler returns a multiplier.

Procedure when assisting a user

Confirm peak_lr came from stage3-lr-range-test divided by 10. If the user picked a number from a paper, verify it's reasonable for their architecture and batch size.
Pick warmup_steps based on model size from the table above. If unsure, default to 1000 — it's almost never wrong by enough to matter.
Set floor_ratio to 0.1 unless the user has a specific reason for a different floor. Floor below 0.01× peak_lr rarely helps.
Implement the schedule in the training entry point. Confirm by running 100 steps and printing lr at steps 0, warmup_steps/2, warmup_steps, and total_steps/2. The values should match the formula.
After the first long run, add a lr curve to the run journal. Compare across runs in runs-diff.

Variants worth knowing (but not default to)

Trapezoid / WSD (Warmup-Stable-Decay): warmup → constant → linear decay. Used in some recent papers for compute-optimal training. Useful when you don't know total steps in advance.
Cyclical LR (Smith): periodic cosine waves. For research; not for production training.
Inverse-sqrt with warmup: classic transformer schedule (Attention Is All You Need). Now usually replaced by cosine.

Boundaries

This schedule assumes a single peak LR. For multi-rate optimization (e.g., different LR for embedding and rest), apply the same schedule to each group.
For mid-training restarts (resume from checkpoint), you must continue the schedule from the resumed step, not restart it from 0. See stage5-checkpoint-cadence.
This is not a substitute for stage3-lr-range-test — the schedule needs a defensible peak LR as input.

Common mistakes

Skipping warmup → early divergence; especially with Adam.
Setting total_steps to a different value than the actual run length → schedule doesn't reach the floor or decays too fast.
Restarting the schedule on resume → corrupts late-training dynamics.
Floor LR set to 0 → optimizer effectively stops; can hurt final convergence.

skills/stage3-lr-range-test — produces the peak_lr.
skills/stage5-loss-spike-rollback — handles early divergence if warmup is insufficient.
skills/stage5-checkpoint-cadence — schedule must coordinate with checkpoint resume.
HuggingFace get_cosine_schedule_with_warmup — reference implementation.

stage5-warmup-cosine

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage5-warmup-cosine

Tool Access

Preview

SKILL.md

Stage 5 · Stabilize · Warmup + cosine schedule

Stage question

The schedule

Why warmup matters

Recommended warmup length

Why cosine specifically

Recommended implementation

Procedure when assisting a user

Variants worth knowing (but not default to)

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 5 · Stabilize · Warmup + cosine schedule

Stage question

The schedule

Why warmup matters

Recommended warmup length

Why cosine specifically

Recommended implementation

Procedure when assisting a user

Variants worth knowing (but not default to)

Boundaries

Common mistakes

Related