From curry-train
Sharding the sequence dimension across ranks (Ring Attention) for very long contexts that don't fit attention memory on a single GPU. Activate when the user asks "context parallel", "CP", "Ring Attention", "long context training", "32k+ sequence", or runs out of memory on attention rather than parameters.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Shards the sequence dimension `N` across `cp_size` ranks; each rank computes attention over its local sequence chunk while exchanging KV across ranks via a ring schedule. Enables training at sequence lengths far beyond what a single GPU can hold.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Shards the sequence dimension N across cp_size ranks; each rank computes attention over its local sequence chunk while exchanging KV across ranks via a ring schedule. Enables training at sequence lengths far beyond what a single GPU can hold.
For sequence length N split across cp_size ranks:
[i * N/cp_size, (i+1) * N/cp_size].from curry_train.primitives import ContextParallel
cp_attn = ContextParallel.wrap(
inner_attention=GQAttention(...),
cp_group=ps.get_cp_group(),
pattern="ring", # one of: ring, double_ring
causal=True,
)
# Inside the model's forward
y = cp_attn(x, ...)
Attention memory is O(B * H * N²). Splitting N by cp_size:
O(B * H * (N/cp_size) * d).O(B * H * (N/cp_size)² ) per rank, plus a one-other-rank buffer for the streaming KV.Net: per-rank attention memory drops from O(N²) to O((N/cp_size)²) plus one ring buffer — roughly linear scaling in cp_size.
For sequences that fit, do not use CP — the comms overhead is real and unnecessary.
N, TP shards D. Combine cleanly.N/cp_size means short messages that don't overlap well — there's a sweet spot.V1: stub at template/curry_train/primitives/context_parallel.py. References: Liu et al. 2023, "Ring Attention with Blockwise Transformers"; HuggingFace and Megatron CP implementations.
skills/primitive-gqattention — wrap the inner attention.skills/primitive-parallel-state — CP group.skills/stage4-parallel-primitive-intro — last primitive to add (only for very long context).