Skill

primitive-context-parallel

Sharding the sequence dimension across ranks (Ring Attention) for very long contexts that don't fit attention memory on a single GPU. Activate when the user asks "context parallel", "CP", "Ring Attention", "long context training", "32k+ sequence", or runs out of memory on attention rather than parameters.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Shards the sequence dimension `N` across `cp_size` ranks; each rank computes attention over its local sequence chunk while exchanging KV across ranks via a ring schedule. Enables training at sequence lengths far beyond what a single GPU can hold.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · ContextParallel

Shards the sequence dimension N across cp_size ranks; each rank computes attention over its local sequence chunk while exchanging KV across ranks via a ring schedule. Enables training at sequence lengths far beyond what a single GPU can hold.

What it does

For sequence length N split across cp_size ranks:

Each rank holds tokens [i * N/cp_size, (i+1) * N/cp_size].
For full attention, KV must reach all ranks; Ring Attention exchanges KV in a circular fashion overlapping with compute.
For causal attention, the lower-triangular structure simplifies the schedule (each rank only needs KVs from earlier-or-equal ranks).

Interface (V1 stub)

from curry_train.primitives import ContextParallel

cp_attn = ContextParallel.wrap(
    inner_attention=GQAttention(...),
    cp_group=ps.get_cp_group(),
    pattern="ring",                # one of: ring, double_ring
    causal=True,
)

# Inside the model's forward
y = cp_attn(x, ...)

Memory

Attention memory is O(B * H * N²). Splitting N by cp_size:

Local Q-K-V tensors: O(B * H * (N/cp_size) * d).
Local attention scores: O(B * H * (N/cp_size)² ) per rank, plus a one-other-rank buffer for the streaming KV.

Net: per-rank attention memory drops from O(N²) to O((N/cp_size)²) plus one ring buffer — roughly linear scaling in cp_size.

When to use

Sequence length > what fits attention memory on a single GPU. Modern threshold is roughly 32k tokens for a 7B model on an 80 GB GPU; longer at smaller models, shorter at larger.

For sequences that fit, do not use CP — the comms overhead is real and unnecessary.

Interaction with other primitives

TP: orthogonal; CP shards N, TP shards D. Combine cleanly.
PP: orthogonal; PP shards layers, CP shards within a layer.
EP: orthogonal; CP shards before routing.
GA / accumulation: CP requires consistent sharding across microbatches.

Boundaries

The fancy parts of attention (sliding window, sparse, alibi) need explicit CP-aware kernels; not every attention variant ships with CP support.
Ring schedule's overlap depends on contiguous communication; very small N/cp_size means short messages that don't overlap well — there's a sweet spot.
CP is the most invasive parallelism dimension. Many existing model implementations don't support it without code surgery.

Implementation status

V1: stub at template/curry_train/primitives/context_parallel.py. References: Liu et al. 2023, "Ring Attention with Blockwise Transformers"; HuggingFace and Megatron CP implementations.

skills/primitive-gqattention — wrap the inner attention.
skills/primitive-parallel-state — CP group.
skills/stage4-parallel-primitive-intro — last primitive to add (only for very long context).
Liu et al. 2023, "Ring Attention".

primitive-context-parallel

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

primitive-context-parallel

Tool Access

Preview

SKILL.md

Primitive · ContextParallel

What it does

Interface (V1 stub)

Memory

When to use

Interaction with other primitives

Boundaries

Implementation status

Related

Similar Skills

Help us improve

Primitive · ContextParallel

What it does

Interface (V1 stub)

Memory

When to use

Interaction with other primitives

Boundaries

Implementation status

Related