From curry-train
Distributed checkpoint format that survives changes in world size and parallelism topology. Built on torch.distributed.checkpoint. Activate when the user asks "DCP", "distributed checkpoint", "resume on different topology", "FSDP checkpoint", or "save sharded model".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Save and load model + optimizer state in a sharded format that does not lock you into the world size or parallelism topology used at save time.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Save and load model + optimizer state in a sharded format that does not lock you into the world size or parallelism topology used at save time.
Wraps torch.distributed.checkpoint to save state in a per-shard format. On load, the runtime can reshard automatically to match the current topology — e.g., save with TP=4 / DP=2, resume with TP=2 / DP=4.
from curry_train.primitives import DCP
# Save
DCP.save(
state={"model": model, "optimizer": optimizer, ...},
checkpoint_id="runs/<id>/recovery-step-00500000",
)
# Load (any topology)
DCP.load(
state={"model": model, "optimizer": optimizer, ...},
checkpoint_id="runs/<id>/recovery-step-00500000",
)
Recommended save state:
state = {
"model": model,
"optimizer": optimizer,
"scheduler": scheduler.state_dict(),
"rng": {
"torch": torch.get_rng_state(),
"torch_cuda": torch.cuda.get_rng_state_all(),
"numpy": np.random.get_state(),
"python": random.getstate(),
},
"step": int,
"dataloader": dataloader.state_dict(), # see torchdata.StatefulDataLoader
}
The model and optimizer are sharded; everything else is small enough to be replicated.
runs/<id>/recovery-step-00500000/
├── .metadata
├── __0_0.distcp
├── __0_1.distcp
├── __1_0.distcp
├── __1_1.distcp
└── ... (one .distcp per (rank, shard) pair)
The .metadata file describes the sharding so any future rank assignment can read its shard.
torch.save(state_dict) either gathers everything to rank 0 (slow, large) or only saves local shards (not resumable).torch.save to a single file becomes a bottleneck.torch.save is fine; DCP overhead isn't worth it.V1: stub at template/curry_train/primitives/dcp.py. Reference: PyTorch torch.distributed.checkpoint (since 2.0+).
skills/stage5-checkpoint-cadence — policy that decides when to save.skills/primitive-distributed-optimizer — FSDP requires DCP for resumable checkpoints.skills/stage5-loss-spike-rollback — uses rollback-frequency DCP saves.