Skill

stage4-parallel-primitive-intro

Decide which parallelism primitive (DP, ZeRO, TP, PP, EP, CP) to introduce next based on what bottleneck appears at the current model size. Activate when the user asks "do I need tensor parallelism", "OOM at scale", "training too slow", "should I add pipeline parallel", "how to scale beyond N GPUs", or after capacity-sweep when single-GPU runs no longer fit.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A decision guide for adding the right parallelism in the right order. The wrong order makes simple problems much harder; the right order makes scaling almost mechanical.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 4 · Scale-up · When to introduce each parallelism primitive

A decision guide for adding the right parallelism in the right order. The wrong order makes simple problems much harder; the right order makes scaling almost mechanical.

Stage question

"What is the actual bottleneck right now — memory, throughput, or both — and which primitive directly addresses it?"

The decision ladder (climb in this order)

DP (Data Parallel)
    ├── add: ZeRO-1 (optimizer state sharding)
    ├── add: ZeRO-2 (optimizer + gradient sharding)
    ├── add: ZeRO-3 / FSDP (fully sharded params)
    ├── add: Activation Recompute (memory)
    ├── add: TP (Tensor Parallel) — only when single-step memory exceeds one device
    ├── add: PP (Pipeline Parallel) — only at multiple nodes or when comms saturate
    ├── add: EP (Expert Parallel) — only for MoE models
    └── add: CP (Context Parallel) — only when sequence length × hidden dim breaks attention memory

The order matters: DP + FSDP solves most problems. TP, PP, EP, CP are progressively heavier and should not be added preemptively.

Step-by-step diagnosis

Step 1 — Single GPU baseline

Get a working single-GPU run. Confirm bench produces a finite loss.

If single-GPU OOMs: add primitive-recompute (activation checkpointing) before reaching for parallelism.

Step 2 — DP + ZeRO

Multi-GPU? Use DDP (torch.nn.parallel.DistributedDataParallel) first. Then layer in ZeRO via FSDP:

ZeRO-1 / ZeRO-2 with FSDP SHARD_GRAD_OP for moderate models.
ZeRO-3 / FSDP FULL_SHARD when model + optimizer state doesn't fit unsharded.

This gets you to 7B–70B fine-tuning on a single 8×A100 node in 2026 (community consensus).

Step 3 — Tensor Parallel

Add TP only when:

A single forward step exceeds GPU memory even with FSDP + recompute.
A single matmul becomes the bottleneck (very wide layers).

TP introduces collective communications inside every forward/backward pass; the overhead is real (10–30% on intra-node, more across nodes). See primitive-tp-linear.

Step 4 — Pipeline Parallel

Add PP only when:

Multi-node training is required and inter-node bandwidth is the bottleneck.
Model is too deep to fit on one node even with TP + FSDP.

PP requires careful schedule design (1F1B, interleaved 1F1B). Bubbles in the schedule waste compute. See primitive-pipeline-schedule.

Step 5 — Expert Parallel (MoE only)

For MoE models with many experts, sharding experts across devices via all-to-all communication is mandatory at scale. See primitive-experts and primitive-topk-router.

Step 6 — Context Parallel (long sequences only)

For sequences longer than what fits in a single GPU's attention memory (typically > 32k tokens), CP shards the sequence dimension and requires distributed attention (Ring Attention or similar). See primitive-context-parallel.

Anti-patterns

TP before FSDP: FSDP solves more problems with less complexity. TP first means you're solving a harder problem unnecessarily.
PP at single-node: PP's bubble overhead is rarely justified within one node where NVLink bandwidth is high. Use TP + FSDP within a node, PP only across nodes.
EP without TP: For very large MoE, you usually want both. EP without TP eventually hits per-expert memory limits.
CP without proven need: CP is the most invasive primitive. Only use it when sequences > 32k actually drive your task.

Procedure when assisting a user

Confirm stage4-capacity-sweep has identified the target size.
Ask the user what currently fails or is slow:
- OOM at single GPU → recompute first; if still OOM, FSDP.
- OOM at full FSDP → TP.
- Slow at multiple nodes → PP if comms are the bottleneck; otherwise more DP rank.
- MoE-specific routing → EP.
- Long context → CP.
Add one primitive at a time and re-bench (bench skill — ask Claude to smoke-test the runtime). Confirm it produces the expected speed/memory change before stacking.
Validate the parallelism implementation against a non-parallel reference using white-box numerical comparison (template/curry_train/validation/whitebox.py). One-step loss and grad-norm should match within fp32 numerical noise. This is critical — most parallelism bugs produce silently-wrong gradients.
Update configs/<name>.yaml to record the parallelism set used; runs-diff will surface any difference.

Boundaries

This skill names the what and when. The how lives in each primitive's own skill (primitive-tp-linear, primitive-pipeline-schedule, etc.).
Performance tuning of each primitive (e.g. tuning communication groups, overlapping comp/comm) is itself a Stage 4 sub-problem; see specific primitive skills.
Multi-node operational concerns (network, scheduler integration) are beyond curryTrain's V1 scope; defer to torchrun + the user's cluster.

skills/primitive-recompute — first thing to try for memory.
skills/primitive-distributed-optimizer — ZeRO / FSDP details.
skills/primitive-tp-linear, primitive-pipeline-schedule, primitive-experts, primitive-context-parallel.
template/curry_train/validation/whitebox.py — required validation after introducing each primitive.

stage4-parallel-primitive-intro

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage4-parallel-primitive-intro

Tool Access

Preview

SKILL.md

Stage 4 · Scale-up · When to introduce each parallelism primitive

Stage question

The decision ladder (climb in this order)

Step-by-step diagnosis

Step 1 — Single GPU baseline

Step 2 — DP + ZeRO

Step 3 — Tensor Parallel

Step 4 — Pipeline Parallel

Step 5 — Expert Parallel (MoE only)

Step 6 — Context Parallel (long sequences only)

Anti-patterns

Procedure when assisting a user

Boundaries

Related

Similar Skills

Help us improve

Stage 4 · Scale-up · When to introduce each parallelism primitive

Stage question

The decision ladder (climb in this order)

Step-by-step diagnosis

Step 1 — Single GPU baseline

Step 2 — DP + ZeRO

Step 3 — Tensor Parallel

Step 4 — Pipeline Parallel

Step 5 — Expert Parallel (MoE only)

Step 6 — Context Parallel (long sequences only)

Anti-patterns

Procedure when assisting a user

Boundaries

Related