Skill

primitive-dcp

Distributed checkpoint format that survives changes in world size and parallelism topology. Built on torch.distributed.checkpoint. Activate when the user asks "DCP", "distributed checkpoint", "resume on different topology", "FSDP checkpoint", or "save sharded model".

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Save and load model + optimizer state in a sharded format that does not lock you into the world size or parallelism topology used at save time.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · DCP (Distributed Checkpoint)

Save and load model + optimizer state in a sharded format that does not lock you into the world size or parallelism topology used at save time.

What it does

Wraps torch.distributed.checkpoint to save state in a per-shard format. On load, the runtime can reshard automatically to match the current topology — e.g., save with TP=4 / DP=2, resume with TP=2 / DP=4.

Interface (V1 stub)

from curry_train.primitives import DCP

# Save
DCP.save(
    state={"model": model, "optimizer": optimizer, ...},
    checkpoint_id="runs/<id>/recovery-step-00500000",
)

# Load (any topology)
DCP.load(
    state={"model": model, "optimizer": optimizer, ...},
    checkpoint_id="runs/<id>/recovery-step-00500000",
)

What gets saved

Recommended save state:

state = {
    "model": model,
    "optimizer": optimizer,
    "scheduler": scheduler.state_dict(),
    "rng": {
        "torch": torch.get_rng_state(),
        "torch_cuda": torch.cuda.get_rng_state_all(),
        "numpy": np.random.get_state(),
        "python": random.getstate(),
    },
    "step": int,
    "dataloader": dataloader.state_dict(),  # see torchdata.StatefulDataLoader
}

The model and optimizer are sharded; everything else is small enough to be replicated.

File layout

runs/<id>/recovery-step-00500000/
├── .metadata
├── __0_0.distcp
├── __0_1.distcp
├── __1_0.distcp
├── __1_1.distcp
└── ...   (one .distcp per (rank, shard) pair)

The .metadata file describes the sharding so any future rank assignment can read its shard.

When DCP, not torch.save

FSDP (any sharding mode): use DCP. torch.save(state_dict) either gathers everything to rank 0 (slow, large) or only saves local shards (not resumable).
Multi-node: use DCP. torch.save to a single file becomes a bottleneck.
Single-process, single-GPU: torch.save is fine; DCP overhead isn't worth it.

Boundaries

DCP is sharded-aware but not automatically backwards compatible across PyTorch versions. Save and load must use compatible PyTorch releases.
DCP file layout assumes a writable directory. Network filesystems with weak consistency can corrupt; use a local fast disk and copy out.
For per-100-step rollback checkpoints, DCP overhead may be too high. A lighter parameter-only save (gathered at rank 0 then broadcast on resume) may be appropriate at high cadence.

Implementation status

V1: stub at template/curry_train/primitives/dcp.py. Reference: PyTorch torch.distributed.checkpoint (since 2.0+).

skills/stage5-checkpoint-cadence — policy that decides when to save.
skills/primitive-distributed-optimizer — FSDP requires DCP for resumable checkpoints.
skills/stage5-loss-spike-rollback — uses rollback-frequency DCP saves.
PyTorch DCP docs.

primitive-dcp

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

primitive-dcp

Tool Access

Preview

SKILL.md

Primitive · DCP (Distributed Checkpoint)

What it does

Interface (V1 stub)

What gets saved

File layout

When DCP, not torch.save

Boundaries

Implementation status

Related

Similar Skills

Help us improve

Primitive · DCP (Distributed Checkpoint)

What it does

Interface (V1 stub)

What gets saved

File layout

When DCP, not torch.save

Boundaries

Implementation status

Related