From curry-train
Centralized state tracking the multi-dimensional parallelism topology — DP rank, TP rank, PP rank, EP rank, CP rank — and the communication groups for each. Activate when the user asks "parallel state", "process group", "rank topology", "world setup", or wires up multi-dim parallelism.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
A small singleton-like object that knows, for the current process: its DP/TP/PP/EP/CP rank, world size on each axis, and the corresponding `torch.distributed.ProcessGroup` for collectives. Every other parallelism primitive depends on this.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
A small singleton-like object that knows, for the current process: its DP/TP/PP/EP/CP rank, world size on each axis, and the corresponding torch.distributed.ProcessGroup for collectives. Every other parallelism primitive depends on this.
(dp_rank, tp_rank, pp_rank, ep_rank, cp_rank).get_tp_group(), get_pp_group(), etc.torch.distributed.init_process_group directly.from curry_train.primitives import ParallelState
# At training start, after torchrun has set RANK / LOCAL_RANK / WORLD_SIZE:
ps = ParallelState.init(
dp_size=2,
tp_size=2,
pp_size=2,
ep_size=1,
cp_size=1,
)
# 2*2*2 = 8 GPUs total
ps.tp_rank # 0..tp_size-1
ps.dp_rank
ps.pp_rank
ps.get_tp_group() # process group for collectives within this TP slice
ps.get_dp_group()
ps.get_pp_group()
ps.is_first_pp_stage()
ps.is_last_pp_stage()
For a world of size W, the parallelism dims partition it: W = dp × tp × pp × ep × cp. The order matters: typical placement is (dp, ep, pp, tp, cp) from outermost to innermost. ParallelState computes the per-rank coordinates from RANK accordingly.
torchrun initializes the world.torchrun / cluster-side concern.primitive-tp-linear, primitive-pipeline-schedule, etc., using the groups it provides.V1: stub at template/curry_train/primitives/parallel_state.py. Reference: Megatron-LM's core/parallel_state.py is the canonical implementation; HuggingFace Accelerate's state.py is a lighter alternative.
primitive-tp-linear, primitive-pipeline-schedule, primitive-experts, primitive-context-parallel) consume groups from this primitive.skills/stage4-parallel-primitive-intro — when to actually wire each axis.megatron/core/parallel_state.py.