From curry-train
Pipeline-parallel schedules (1F1B, interleaved 1F1B, GPipe). Manages microbatches flowing through stages on different ranks. Activate when the user asks "pipeline parallel", "PP", "1F1B", "GPipe", "interleaved pipeline", or has more layers than fit on a single node.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Coordinates the flow of microbatches through a multi-stage pipeline (each stage on a different rank set). Implements the schedule that determines when each stage does forward, when backward, and when activations are sent across the rank boundary.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Coordinates the flow of microbatches through a multi-stage pipeline (each stage on a different rank set). Implements the schedule that determines when each stage does forward, when backward, and when activations are sent across the rank boundary.
Given a model split into P stages, each on a different rank, the schedule:
| Schedule | Bubble | Memory |
|---|---|---|
| GPipe (all-forward, all-backward) | 0 (asymptotic) but high peak memory; full activations across all microbatches stored | high |
| 1F1B (one-forward-one-backward) | small bubble at start/end | medium |
| Interleaved 1F1B | smaller bubble; multiple virtual pipeline stages per rank | medium |
| ZB-H1 / ZB-H2 (zero-bubble) | near-zero bubble | similar to 1F1B but more complex |
For most users, 1F1B is the right default. Interleaved 1F1B is worth the complexity at very large pipeline depths.
from curry_train.primitives import PipelineSchedule
schedule = PipelineSchedule(
kind="1F1B", # or "interleaved" or "gpipe"
n_microbatches=8,
stages=stage_modules, # list of nn.Module per stage
pp_group=ps.get_pp_group(),
)
result = schedule.train_step(microbatches) # full forward + backward + step
Imagine P=4 stages and M=8 microbatches:
Result: each stage is busy for M + P - 1 "slots", giving a "bubble" of (P-1) / (M+P-1). For P=4, M=32, the bubble is 3/35 = 8.5%.
If you can avoid PP, do. Within a single node, TP+FSDP is usually preferable. PP introduces complexity (schedule, bubble, send/recv buffers) that should not be paid unless necessary.
--pipeline-model-parallel-size and balance manually if needed.V1: stub at template/curry_train/primitives/pipeline_schedule.py. References: Megatron-LM core/pipeline_parallel/schedules.py; DeepSpeed pipeline.PipelineModule.
skills/primitive-parallel-state — provides the PP group and stage assignment.skills/primitive-distributed-optimizer — interacts with sharding inside DP.skills/stage4-parallel-primitive-intro — when to add PP.