From curry-train
Decide how often to checkpoint, what to checkpoint (full vs parameter-only), and how many to keep — balancing recovery, rollback, and storage. Activate when the user asks "how often should I checkpoint", "checkpoint policy", "rollback checkpoint", "DCP setup", "best-K checkpoints", or before any long-running training.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A checkpoint policy that supports both **recovery** (resume from any failure) and **rollback** (loss-spike recovery), with reasonable storage cost.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A checkpoint policy that supports both recovery (resume from any failure) and rollback (loss-spike recovery), with reasonable storage cost.
"If this run crashes or spikes at step T, what's the worst-case resume point and how much storage do I commit?"
A good policy distinguishes:
| Role | Cadence | Contents | Retention |
|---|---|---|---|
| Rollback | every ~100 steps | parameters + optimizer state + RNG state | last 5 (rolling) |
| Recovery | every ~1000 steps | full state (params + optim + scheduler + dataloader + RNG) | last 3 (rolling) |
| Best-K | on dev metric improvement | parameters only | top 3 (by dev metric) |
This gives:
stage5-loss-spike-rollback (~100-step-resolution spike recovery).For a model with P parameters in fp32 and Adam optimizer:
For a 7B model (P ≈ 7e9): ~84 GB per full checkpoint. Storing 5 rollback + 3 recovery + 3 best-K = 11 full checkpoints ≈ ~900 GB. Most users keep rollback as parameter-only (lighter) and only the recovery checkpoints as full state.
For sharded training (FSDP, ZeRO-3), use PyTorch DCP (torch.distributed.checkpoint) which writes a checkpoint that any future world size can resume:
import torch.distributed.checkpoint as dcp
state = {
"model": model, # FSDP-wrapped
"optimizer": optimizer,
"scheduler": scheduler.state_dict(),
"rng": get_rng_state(),
"step": step,
"dataloader": dataloader.state_dict(),
}
dcp.save(state, checkpoint_id=f"runs/<run-id>/recovery-{step:08d}")
The corresponding load:
dcp.load(state, checkpoint_id=...)
DCP supports partial saves (for rollback's parameter-only checkpoints) and deals with rank changes (resume on different topology). See primitive-dcp for details.
runs/<run-id>/
├── rollback/
│ ├── step-00500000.dcp/
│ ├── step-00500100.dcp/
│ └── ... (rolling, last 5)
├── recovery/
│ ├── step-00500000.dcp/
│ └── ... (rolling, last 3)
└── best-k/
├── dev-loss-2.341-step-00450000.dcp/
└── ... (top 3 by metric)
Symlinks latest -> recovery/step-XXX.dcp make resume trivial.
On resume:
runs-diff should show "run-X resumed from step Y" as a top-line fact.For loss-spike-rollback resume, additionally load the data-skip offset.
Estimate per-checkpoint size from P and the format. If > 10 GB, push the user toward parameter-only rollback checkpoints to control storage.
Set up the three cadences in the config:
checkpoint:
rollback_every: 100
recovery_every: 1000
best_k: 3
best_k_metric: dev_loss
best_k_direction: minimize
retention:
rollback_keep: 5
recovery_keep: 3
Wire save_checkpoint (in curry_train.loop) to the three cadences.
Test resume before committing to a long run. A 30-minute test run, killed at step 500, resumed: confirm the resumed run reaches the same loss curve as one that ran straight through.
Add cleanup logic so old rollback checkpoints are deleted automatically.
primitive-dcp.skills/primitive-dcp — distributed checkpoint implementation.skills/stage5-loss-spike-rollback — uses the rollback cadence.skills/stage5-run-journal — journal every checkpoint event.skills/stage5-warmup-cosine — schedule must be resumable from any step.