Skill

stage5-checkpoint-cadence

Decide how often to checkpoint, what to checkpoint (full vs parameter-only), and how many to keep — balancing recovery, rollback, and storage. Activate when the user asks "how often should I checkpoint", "checkpoint policy", "rollback checkpoint", "DCP setup", "best-K checkpoints", or before any long-running training.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A checkpoint policy that supports both **recovery** (resume from any failure) and **rollback** (loss-spike recovery), with reasonable storage cost.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 5 · Stabilize · Checkpoint cadence policy

A checkpoint policy that supports both recovery (resume from any failure) and rollback (loss-spike recovery), with reasonable storage cost.

Stage question

"If this run crashes or spikes at step T, what's the worst-case resume point and how much storage do I commit?"

Three checkpoint roles, three cadences

A good policy distinguishes:

Role	Cadence	Contents	Retention
Rollback	every ~100 steps	parameters + optimizer state + RNG state	last 5 (rolling)
Recovery	every ~1000 steps	full state (params + optim + scheduler + dataloader + RNG)	last 3 (rolling)
Best-K	on dev metric improvement	parameters only	top 3 (by dev metric)

This gives:

Rollback granularity for stage5-loss-spike-rollback (~100-step-resolution spike recovery).
Bounded crash-recovery loss (lose at most ~1000 steps).
A small archive of the model's best moments by metric.

Storage estimation

For a model with P parameters in fp32 and Adam optimizer:

Parameter-only checkpoint: ~4P bytes
Full optimizer state: ~12P bytes (params + grad + Adam m + v in fp32) or ~8P with mixed precision.
Total checkpoint: ~12P bytes.

For a 7B model (P ≈ 7e9): ~84 GB per full checkpoint. Storing 5 rollback + 3 recovery + 3 best-K = 11 full checkpoints ≈ ~900 GB. Most users keep rollback as parameter-only (lighter) and only the recovery checkpoints as full state.

Distributed Checkpoint (DCP)

For sharded training (FSDP, ZeRO-3), use PyTorch DCP (torch.distributed.checkpoint) which writes a checkpoint that any future world size can resume:

import torch.distributed.checkpoint as dcp

state = {
    "model": model,                # FSDP-wrapped
    "optimizer": optimizer,
    "scheduler": scheduler.state_dict(),
    "rng": get_rng_state(),
    "step": step,
    "dataloader": dataloader.state_dict(),
}
dcp.save(state, checkpoint_id=f"runs/<run-id>/recovery-{step:08d}")

The corresponding load:

dcp.load(state, checkpoint_id=...)

DCP supports partial saves (for rollback's parameter-only checkpoints) and deals with rank changes (resume on different topology). See primitive-dcp for details.

Recommended directory layout

runs/<run-id>/
├── rollback/
│   ├── step-00500000.dcp/
│   ├── step-00500100.dcp/
│   └── ... (rolling, last 5)
├── recovery/
│   ├── step-00500000.dcp/
│   └── ... (rolling, last 3)
└── best-k/
    ├── dev-loss-2.341-step-00450000.dcp/
    └── ... (top 3 by metric)

Symlinks latest -> recovery/step-XXX.dcp make resume trivial.

Resume semantics

On resume:

Load the most recent recovery checkpoint.
Restore RNG state, dataloader position, scheduler state.
Continue training from the checkpointed step.
The run journal must record the resume — runs-diff should show "run-X resumed from step Y" as a top-line fact.

For loss-spike-rollback resume, additionally load the data-skip offset.

Procedure when assisting a user

Estimate per-checkpoint size from P and the format. If > 10 GB, push the user toward parameter-only rollback checkpoints to control storage.
Set up the three cadences in the config:

checkpoint:
  rollback_every: 100
  recovery_every: 1000
  best_k: 3
  best_k_metric: dev_loss
  best_k_direction: minimize
  retention:
    rollback_keep: 5
    recovery_keep: 3

Wire save_checkpoint (in curry_train.loop) to the three cadences.
Test resume before committing to a long run. A 30-minute test run, killed at step 500, resumed: confirm the resumed run reaches the same loss curve as one that ran straight through.
Add cleanup logic so old rollback checkpoints are deleted automatically.

Storage failure modes

Filesystem fills up during training → run crashes. Set up a cleanup that deletes rollback checkpoints older than the retention window.
Concurrent writes from multi-rank → corruption. Use DCP, which is rank-aware.
Network filesystem latency → checkpointing dominates wall time. Use a local fast disk for active checkpoints; archive to network fs asynchronously.

Boundaries

This skill describes a policy, not the implementation primitive. The implementation is primitive-dcp.
Best-K checkpoints are by dev metric, not training loss. Don't rank by training loss.
Don't keep all checkpoints — that's a storage anti-pattern. Use rolling retention.

Common mistakes

Single checkpoint cadence trying to serve all three roles → either too frequent (expensive) or too sparse (poor rollback granularity).
No RNG state in checkpoint → resumed run produces different data ordering, not actually equivalent.
Resume not tested → first real failure reveals the resume is broken.
Storing all best-K → unbounded growth.

skills/primitive-dcp — distributed checkpoint implementation.
skills/stage5-loss-spike-rollback — uses the rollback cadence.
skills/stage5-run-journal — journal every checkpoint event.
skills/stage5-warmup-cosine — schedule must be resumable from any step.

stage5-checkpoint-cadence

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage5-checkpoint-cadence

Tool Access

Preview

SKILL.md

Stage 5 · Stabilize · Checkpoint cadence policy

Stage question

Three checkpoint roles, three cadences

Storage estimation

Distributed Checkpoint (DCP)

Recommended directory layout

Resume semantics

Procedure when assisting a user

Storage failure modes

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 5 · Stabilize · Checkpoint cadence policy

Stage question

Three checkpoint roles, three cadences

Storage estimation

Distributed Checkpoint (DCP)

Recommended directory layout

Resume semantics

Procedure when assisting a user

Storage failure modes

Boundaries

Common mistakes

Related