Skill

stage5-loss-spike-rollback

An automated recovery procedure for loss spikes during long-running training — detect a spike, roll back to a recent checkpoint, skip a window of batches, resume. Modeled on the PaLM training paper. Activate when the user asks "loss spike", "training spiked then crashed", "recover from divergence", "PaLM rollback recipe", or experiences instability mid-run.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A specific recovery recipe for training instabilities that produce a loss spike: rather than killing the run or letting it diverge, roll back to a recent checkpoint, skip a small window of training batches, and resume. Empirically robust for transformer training at scale.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 5 · Stabilize · Loss spike rollback (PaLM recipe)

Stage question

"When training spikes, can I recover the run instead of restarting from scratch?"

Often: yes, with the right recipe.

When this applies

The recipe is appropriate when:

Training has been progressing well, then suddenly the loss spikes (e.g., 5–10× higher than recent rolling minimum).
The spike is not caused by a bug (preflight asserts pass, no code change since the last known-good step).
A recent (within last ~1000 steps) checkpoint exists.

The recipe is not appropriate for:

First-time training of a new architecture (likely a real bug; investigate).
Persistent slow divergence (use stage3-kill-criterion and re-tune).
Catastrophic NaN at step 0 (see stage1-preflight-asserts).

The recipe (PaLM paper, simplified)

Detect the spike. Trigger when current loss > 5× rolling-min(last 1000 steps) for 100 consecutive steps.
Roll back to a checkpoint approximately 100 steps before the spike began.
Skip the next ~200–500 batches that the data loader would have produced. The hypothesis is that a data-related anomaly triggered the spike; skipping past it removes the trigger.
Resume training from that checkpoint, with that data offset.
Watch for another spike within the next ~2000 steps. If it spikes again, escalate (probably a real bug, not a data anomaly).

Recommended implementation

This recipe is non-trivial because it requires:

Frequent enough checkpoints to roll back to (every ~100 steps; lightweight checkpoint that only stores parameters and optimizer state).
A data loader that supports deterministic skipping by index.
A monitor that watches the rolling loss in real time.

class LossSpikeWatchdog:
    def __init__(self, *, threshold_ratio=5.0, sustain_steps=100,
                 min_history=1000):
        self.recent = collections.deque(maxlen=min_history)
        self.threshold_ratio = threshold_ratio
        self.sustain_steps = sustain_steps
        self._spike_start = None

    def update(self, loss: float, step: int) -> Optional[dict]:
        self.recent.append(loss)
        if len(self.recent) < self.recent.maxlen:
            return None
        rolling_min = min(self.recent)
        if loss > self.threshold_ratio * rolling_min:
            self._spike_start = self._spike_start or step
            if step - self._spike_start >= self.sustain_steps:
                # Spike sustained: trigger rollback
                return {
                    "rollback_to_step": max(0, self._spike_start - 100),
                    "skip_batches": 200,
                    "reason": "loss_spike_sustained",
                }
        else:
            self._spike_start = None
        return None

The training loop checks the watchdog every step; on trigger, it loads the rollback checkpoint and applies the data-skip.

Combining with kill criterion

Loss spike rollback and stage3-kill-criterion work together:

Rollback handles transient spikes (recover and continue).
Kill criterion catches sustained failure even after rollback (if rollback fails twice → kill; this is a real failure, not noise).

A reasonable composite policy: rollback up to 2 times; on the 3rd spike within the same run, kill.

Procedure when assisting a user

Confirm checkpoint cadence is fine-grained enough. The user needs a checkpoint every ~100 steps for this to work; see stage5-checkpoint-cadence.
Confirm the data loader supports skipping by index. If it's a DataLoader over a deterministic dataset, this is easy. If it's a stream (WebDataset, mosaic), confirm the offset semantics.
Wire up LossSpikeWatchdog to the training loop's step callback (on_step_end in curry_train.loop).
Add the watchdog's trigger event to the run journal — every rollback should be visible after the fact for analysis.
After a successful rollback, log the rollback metadata (from-step, to-step, batches-skipped, reason). This is essential for runs-diff and for distinguishing "clean run" from "recovered run" in subsequent analysis.

Why skipping batches matters

The PaLM paper observed that loss spikes were often correlated with specific data shards (e.g. all-junk batches that caused gradient explosion). Without the skip, resuming from rollback would re-encounter the same trigger. With skip, the run progresses past it.

If your data is well-curated and you don't observe data-correlated spikes, you may not need the skip — but it's cheap insurance to leave it in.

Boundaries

This recipe assumes lightweight checkpoints (every ~100 steps). On very large models, that may be expensive; use partial / parameter-only checkpoints.
Skipping batches assumes deterministic data ordering. For multi-worker, distributed shuffled loaders, the deterministic-skip semantics need care.
Three rollbacks in one run usually means a real architectural or hyperparameter problem. Don't keep rolling back forever.

Common mistakes

Detecting on a single-step spike (no sustain) → false positives on transient noise.
Rollback without the data skip → re-encounters the same trigger.
No upper limit on rollbacks → run hangs indefinitely.
Not journaling rollbacks → post-hoc analysis becomes very confusing.

skills/stage5-checkpoint-cadence — defines the lightweight-checkpoint cadence required for rollback.
skills/stage3-kill-criterion — the upstream kill rule when rollback can't save the run.
skills/stage5-run-journal — journals every rollback event.
Chowdhery et al. 2022, "PaLM: Scaling Language Modeling with Pathways", §6.2 (training instability).

stage5-loss-spike-rollback

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage5-loss-spike-rollback

Tool Access

Preview

SKILL.md

Stage 5 · Stabilize · Loss spike rollback (PaLM recipe)

Stage question

When this applies

The recipe (PaLM paper, simplified)

Recommended implementation

Combining with kill criterion

Procedure when assisting a user

Why skipping batches matters

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 5 · Stabilize · Loss spike rollback (PaLM recipe)

Stage question

When this applies

The recipe (PaLM paper, simplified)

Recommended implementation

Combining with kill criterion

Procedure when assisting a user

Why skipping batches matters

Boundaries

Common mistakes

Related