Skill

primitive-pipeline-schedule

Pipeline-parallel schedules (1F1B, interleaved 1F1B, GPipe). Manages microbatches flowing through stages on different ranks. Activate when the user asks "pipeline parallel", "PP", "1F1B", "GPipe", "interleaved pipeline", or has more layers than fit on a single node.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Coordinates the flow of microbatches through a multi-stage pipeline (each stage on a different rank set). Implements the schedule that determines when each stage does forward, when backward, and when activations are sent across the rank boundary.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · PipelineSchedule

What it does

Given a model split into P stages, each on a different rank, the schedule:

Distributes the optimizer-step microbatches into a pipeline.
Sends activations forward stage-to-stage.
Sends gradients backward stage-to-stage.
Tracks which microbatch each stage is currently processing (forward and backward).

Schedule variants

Schedule	Bubble	Memory
GPipe (all-forward, all-backward)	0 (asymptotic) but high peak memory; full activations across all microbatches stored	high
1F1B (one-forward-one-backward)	small bubble at start/end	medium
Interleaved 1F1B	smaller bubble; multiple virtual pipeline stages per rank	medium
ZB-H1 / ZB-H2 (zero-bubble)	near-zero bubble	similar to 1F1B but more complex

For most users, 1F1B is the right default. Interleaved 1F1B is worth the complexity at very large pipeline depths.

Interface (V1 stub)

from curry_train.primitives import PipelineSchedule

schedule = PipelineSchedule(
    kind="1F1B",                    # or "interleaved" or "gpipe"
    n_microbatches=8,
    stages=stage_modules,           # list of nn.Module per stage
    pp_group=ps.get_pp_group(),
)

result = schedule.train_step(microbatches)  # full forward + backward + step

How 1F1B works (for reference)

Imagine P=4 stages and M=8 microbatches:

Warmup: stage 0 forwards microbatches 0..3, sending activations to stage 1.
Steady state: each stage does one forward, then one backward, alternating.
Cooldown: backward pass for the last microbatches drains.

Result: each stage is busy for M + P - 1 "slots", giving a "bubble" of (P-1) / (M+P-1). For P=4, M=32, the bubble is 3/35 = 8.5%.

When to use

Multi-node training where TP can't reach across the node boundary efficiently.
Models with depth > what fits on a single node even with TP+FSDP.

If you can avoid PP, do. Within a single node, TP+FSDP is usually preferable. PP introduces complexity (schedule, bubble, send/recv buffers) that should not be paid unless necessary.

Interaction with other primitives

TP: Each stage internally uses TP for its layers. PP and TP are orthogonal.
DP: PP groups are organized inside DP groups; gradients reduce within DP only.
GA: GA's microbatches become PP's microbatches; do not stack independently.
Recompute: Stage-level recompute can help reduce activation memory for non-bubble period.

Boundaries

PP requires careful stage balancing — if stages are unequal in compute, the schedule has gaps. Use Megatron's --pipeline-model-parallel-size and balance manually if needed.
PP for unconventional architectures (e.g., SNNs with time dimension) may need a custom schedule; the standard 1F1B assumes a forward function returning a single tensor per microbatch.

Implementation status

V1: stub at template/curry_train/primitives/pipeline_schedule.py. References: Megatron-LM core/pipeline_parallel/schedules.py; DeepSpeed pipeline.PipelineModule.

skills/primitive-parallel-state — provides the PP group and stage assignment.
skills/primitive-distributed-optimizer — interacts with sharding inside DP.
skills/stage4-parallel-primitive-intro — when to add PP.
Huang et al. 2018, "GPipe"; Narayanan et al. 2021, "Efficient Large-Scale Language Model Training".

primitive-pipeline-schedule

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

primitive-pipeline-schedule

Tool Access

Preview

SKILL.md

Primitive · PipelineSchedule

What it does

Schedule variants

Interface (V1 stub)

How 1F1B works (for reference)

When to use

Interaction with other primitives

Boundaries

Implementation status

Related

Similar Skills

Help us improve

Primitive · PipelineSchedule

What it does

Schedule variants

Interface (V1 stub)

How 1F1B works (for reference)

When to use

Interaction with other primitives

Boundaries

Implementation status

Related