primitive-distributed-optimizer | curry-train | ClaudePluginHub

Skill

primitive-distributed-optimizer

From curry-train

Optimizer state sharding across DP ranks (ZeRO-1, ZeRO-2, ZeRO-3 / FSDP). Reduces per-rank memory by sharding gradient and/or parameter copies. Activate when the user asks "ZeRO", "FSDP", "optimizer sharding", "distributed optimizer", or "OOM in optimizer".

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Optimizer state sharded across DP ranks. Implements the ZeRO family of memory optimizations: ZeRO-1 (optimizer states), ZeRO-2 (+ gradients), ZeRO-3 / FSDP (+ parameters).

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · DistributedOptimizer

Optimizer state sharded across DP ranks. Implements the ZeRO family of memory optimizations: ZeRO-1 (optimizer states), ZeRO-2 (+ gradients), ZeRO-3 / FSDP (+ parameters).

What it does

Standard data-parallel: every rank holds full copies of params, gradients, and optimizer state.

ZeRO partitions these:

ZeRO-1: optimizer state sharded; params and grads still replicated. ~4× memory saving on Adam state.
ZeRO-2: ZeRO-1 + gradients sharded. ~8× saving combined.
ZeRO-3 / FSDP: ZeRO-2 + parameters sharded. ~N× saving where N is DP size, at cost of extra all-gather per forward/backward pass.

PyTorch's FSDP is the canonical implementation today.

Interface (V1 stub)

from curry_train.primitives import DistributedOptimizer

# Wrap the model
model = DistributedOptimizer.wrap(
    model,
    sharding="full",           # one of: "no_shard", "shard_grad_op" (ZeRO-2), "full" (ZeRO-3)
    mixed_precision="bf16",    # bf16 or fp16 or fp32
    cpu_offload=False,
    backward_prefetch="backward_pre",  # FSDP all-gather scheduling
    use_orig_params=True,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.lr)

ZeRO-3 / FSDP key properties

Each rank only stores 1/W of the parameters at rest.
Forward/backward of each layer requires an all-gather of its parameters, then re-shards after use.
Gradients are reduce-scattered (only the owning rank gets the full gradient).
Optimizer step happens on each rank's shard of parameters.

The cost: one all-gather + one reduce-scatter per layer per training step. On NVLink-connected single-node, this is fast; across nodes, it's the dominant communication cost.

When to use which level

ZeRO-1: little gain over plain DDP for modern Adam variants; rarely used as final choice.
ZeRO-2 (shard_grad_op): reasonable middle ground — shards optimizer state and gradients.
ZeRO-3 / FSDP full: the default for any model that doesn't fit unsharded. 2026 community consensus is FSDP for 7B–70B fine-tuning on a single node.

Interaction

TP: FSDP wraps modules; TP shards inside modules. Use use_orig_params=True to avoid conflicts.
Recompute: orthogonal; FSDP + recompute compose cleanly.
DCP: required for resumable FSDP checkpoints; see primitive-dcp.

Boundaries

FSDP full adds a per-layer comm overhead; for small models it can be slower than plain DDP. Don't use it preemptively.
CPU offload trades GPU memory for host-GPU bandwidth pressure; only useful when GPU memory is the bottleneck.
use_orig_params=False is faster but interacts badly with TP and complex models. Default to True.

Implementation status

V1: stub at template/curry_train/primitives/distributed_optimizer.py. PyTorch's torch.distributed.fsdp.FullyShardedDataParallel is the canonical implementation; the DeepSpeed ZeRO-Optimizer is the alternative.

Related

skills/primitive-parallel-state — DP group used for sharding.
skills/primitive-dcp — required to checkpoint a FSDP-wrapped model.
skills/stage4-parallel-primitive-intro — when to add this.
PyTorch FSDP docs.