From curry-train
Optimizer state sharding across DP ranks (ZeRO-1, ZeRO-2, ZeRO-3 / FSDP). Reduces per-rank memory by sharding gradient and/or parameter copies. Activate when the user asks "ZeRO", "FSDP", "optimizer sharding", "distributed optimizer", or "OOM in optimizer".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Optimizer state sharded across DP ranks. Implements the ZeRO family of memory optimizations: ZeRO-1 (optimizer states), ZeRO-2 (+ gradients), ZeRO-3 / FSDP (+ parameters).
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Optimizer state sharded across DP ranks. Implements the ZeRO family of memory optimizations: ZeRO-1 (optimizer states), ZeRO-2 (+ gradients), ZeRO-3 / FSDP (+ parameters).
Standard data-parallel: every rank holds full copies of params, gradients, and optimizer state.
ZeRO partitions these:
PyTorch's FSDP is the canonical implementation today.
from curry_train.primitives import DistributedOptimizer
# Wrap the model
model = DistributedOptimizer.wrap(
model,
sharding="full", # one of: "no_shard", "shard_grad_op" (ZeRO-2), "full" (ZeRO-3)
mixed_precision="bf16", # bf16 or fp16 or fp32
cpu_offload=False,
backward_prefetch="backward_pre", # FSDP all-gather scheduling
use_orig_params=True,
)
optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.lr)
The cost: one all-gather + one reduce-scatter per layer per training step. On NVLink-connected single-node, this is fast; across nodes, it's the dominant communication cost.
shard_grad_op): reasonable middle ground — shards optimizer state and gradients.full: the default for any model that doesn't fit unsharded. 2026 community consensus is FSDP for 7B–70B fine-tuning on a single node.use_orig_params=True to avoid conflicts.primitive-dcp.full adds a per-layer comm overhead; for small models it can be slower than plain DDP. Don't use it preemptively.use_orig_params=False is faster but interacts badly with TP and complex models. Default to True.V1: stub at template/curry_train/primitives/distributed_optimizer.py. PyTorch's torch.distributed.fsdp.FullyShardedDataParallel is the canonical implementation; the DeepSpeed ZeRO-Optimizer is the alternative.
skills/primitive-parallel-state — DP group used for sharding.skills/primitive-dcp — required to checkpoint a FSDP-wrapped model.skills/stage4-parallel-primitive-intro — when to add this.