From curry-train
A standard warmup-then-cosine learning rate schedule that prevents early divergence and produces stable long-run training. Activate when the user asks "what learning rate schedule", "warmup", "cosine schedule", "no warmup is bad", "schedule diverges at start", or before any long run.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The default LR schedule for transformer training: a short linear warmup followed by a cosine decay to a small floor. Robust, well-understood, and almost always a sensible default.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The default LR schedule for transformer training: a short linear warmup followed by a cosine decay to a small floor. Robust, well-understood, and almost always a sensible default.
"What LR schedule should I use for a long, stable run?"
peak_lr → from `stage3-lr-range-test`, divided by 10
warmup → linear from 0 to peak_lr over K_warmup steps
decay → cosine from peak_lr to floor_lr (typically peak_lr / 10) over remaining steps
floor → constant floor_lr after decay completes
Total schedule length = warmup + decay (+ optional flat tail).
At step 0 with Adam, the second-moment estimate v is small, so the effective step size is large for any nonzero gradient. Without warmup, this routinely produces:
Warmup (linear from 0) tames this. Aggressive Adam-based training without warmup is roulette.
These are not magic numbers. The right warmup is "long enough that grad norm during warmup is stable". Watch grad norm in early steps; if it's bouncing, extend warmup.
Cosine decay is empirically the most robust default among:
Cosine has no hyperparameters of its own (just two endpoints), is monotonically decreasing, and gives the model time at the floor LR for fine convergence.
import math
from torch.optim.lr_scheduler import LambdaLR
def warmup_cosine_schedule(optimizer, peak_lr, total_steps, warmup_steps,
floor_ratio=0.1):
"""Multiplier-based schedule: returns LambdaLR.
multiplier(step) = step / warmup_steps if step < warmup_steps
= floor + (1 - floor) * 0.5 * (1 + cos(progress * pi))
where progress = (step - warmup_steps) / (total_steps - warmup_steps)
"""
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
if step >= total_steps:
return floor_ratio
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return floor_ratio + (1 - floor_ratio) * 0.5 * (1 + math.cos(progress * math.pi))
return LambdaLR(optimizer, lr_lambda)
The peak_lr is set as the optimizer's lr argument; the scheduler returns a multiplier.
Confirm peak_lr came from stage3-lr-range-test divided by 10. If the user picked a number from a paper, verify it's reasonable for their architecture and batch size.
Pick warmup_steps based on model size from the table above. If unsure, default to 1000 — it's almost never wrong by enough to matter.
Set floor_ratio to 0.1 unless the user has a specific reason for a different floor. Floor below 0.01× peak_lr rarely helps.
Implement the schedule in the training entry point. Confirm by running 100 steps and printing lr at steps 0, warmup_steps/2, warmup_steps, and total_steps/2. The values should match the formula.
After the first long run, add a lr curve to the run journal. Compare across runs in runs-diff.
stage5-checkpoint-cadence.stage3-lr-range-test — the schedule needs a defensible peak LR as input.total_steps to a different value than the actual run length → schedule doesn't reach the floor or decays too fast.skills/stage3-lr-range-test — produces the peak_lr.skills/stage5-loss-spike-rollback — handles early divergence if warmup is insufficient.skills/stage5-checkpoint-cadence — schedule must coordinate with checkpoint resume.get_cosine_schedule_with_warmup — reference implementation.