From curry-train
Decide which parallelism primitive (DP, ZeRO, TP, PP, EP, CP) to introduce next based on what bottleneck appears at the current model size. Activate when the user asks "do I need tensor parallelism", "OOM at scale", "training too slow", "should I add pipeline parallel", "how to scale beyond N GPUs", or after capacity-sweep when single-GPU runs no longer fit.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A decision guide for adding the right parallelism in the right order. The wrong order makes simple problems much harder; the right order makes scaling almost mechanical.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A decision guide for adding the right parallelism in the right order. The wrong order makes simple problems much harder; the right order makes scaling almost mechanical.
"What is the actual bottleneck right now — memory, throughput, or both — and which primitive directly addresses it?"
DP (Data Parallel)
├── add: ZeRO-1 (optimizer state sharding)
├── add: ZeRO-2 (optimizer + gradient sharding)
├── add: ZeRO-3 / FSDP (fully sharded params)
├── add: Activation Recompute (memory)
├── add: TP (Tensor Parallel) — only when single-step memory exceeds one device
├── add: PP (Pipeline Parallel) — only at multiple nodes or when comms saturate
├── add: EP (Expert Parallel) — only for MoE models
└── add: CP (Context Parallel) — only when sequence length × hidden dim breaks attention memory
The order matters: DP + FSDP solves most problems. TP, PP, EP, CP are progressively heavier and should not be added preemptively.
Get a working single-GPU run. Confirm bench produces a finite loss.
If single-GPU OOMs: add primitive-recompute (activation checkpointing) before reaching for parallelism.
Multi-GPU? Use DDP (torch.nn.parallel.DistributedDataParallel) first. Then layer in ZeRO via FSDP:
SHARD_GRAD_OP for moderate models.FULL_SHARD when model + optimizer state doesn't fit unsharded.This gets you to 7B–70B fine-tuning on a single 8×A100 node in 2026 (community consensus).
Add TP only when:
TP introduces collective communications inside every forward/backward pass; the overhead is real (10–30% on intra-node, more across nodes). See primitive-tp-linear.
Add PP only when:
PP requires careful schedule design (1F1B, interleaved 1F1B). Bubbles in the schedule waste compute. See primitive-pipeline-schedule.
For MoE models with many experts, sharding experts across devices via all-to-all communication is mandatory at scale. See primitive-experts and primitive-topk-router.
For sequences longer than what fits in a single GPU's attention memory (typically > 32k tokens), CP shards the sequence dimension and requires distributed attention (Ring Attention or similar). See primitive-context-parallel.
Confirm stage4-capacity-sweep has identified the target size.
Ask the user what currently fails or is slow:
Add one primitive at a time and re-bench (bench skill — ask Claude to smoke-test the runtime). Confirm it produces the expected speed/memory change before stacking.
Validate the parallelism implementation against a non-parallel reference using white-box numerical comparison (template/curry_train/validation/whitebox.py). One-step loss and grad-norm should match within fp32 numerical noise. This is critical — most parallelism bugs produce silently-wrong gradients.
Update configs/<name>.yaml to record the parallelism set used; runs-diff will surface any difference.
primitive-tp-linear, primitive-pipeline-schedule, etc.).torchrun + the user's cluster.skills/primitive-recompute — first thing to try for memory.skills/primitive-distributed-optimizer — ZeRO / FSDP details.skills/primitive-tp-linear, primitive-pipeline-schedule, primitive-experts, primitive-context-parallel.template/curry_train/validation/whitebox.py — required validation after introducing each primitive.