From curry-train
Tensor-parallel linear layers — column-parallel and row-parallel — for splitting matmuls across GPUs along the output or input feature dimension. Activate when the user asks "tensor parallel", "column parallel linear", "row parallel linear", "TP", or "split matmul across GPUs".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The two building blocks that make a transformer tensor-parallel: `ColumnParallelLinear` (split along output dim, gather at end) and `RowParallelLinear` (split along input dim, all-reduce at end). All of attention's QKV/out and MLP's up/down can be expressed with these.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The two building blocks that make a transformer tensor-parallel: ColumnParallelLinear (split along output dim, gather at end) and RowParallelLinear (split along input dim, all-reduce at end). All of attention's QKV/out and MLP's up/down can be expressed with these.
Splits a Linear(in, out) along the output dimension across tp_size ranks. Each rank holds Linear(in, out/tp_size).
AllGather is required.AllReduce).Splits a Linear(in, out) along the input dimension. Each rank holds Linear(in/tp_size, out).
AllReduce to combine.attention QKV = ColumnParallel (no gather, output slice fed to attention)
attention out = RowParallel (input is per-rank; all-reduce after)
MLP up = ColumnParallel
MLP down = RowParallel
This pattern means two AllReduces per transformer block (one after attention out, one after MLP down). Communication cost is significant but predictable.
from curry_train.primitives import ColumnParallelLinear, RowParallelLinear
q_proj = ColumnParallelLinear(
in_features=2048,
out_features=2048,
tp_group=ps.get_tp_group(),
bias=False,
gather_output=False, # downstream uses the slice; common in attention
)
o_proj = RowParallelLinear(
in_features=2048,
out_features=2048,
tp_group=ps.get_tp_group(),
bias=False,
input_is_parallel=True, # input is already per-rank
)
full and recompute.Almost always paired with FSDP for the data-parallel axis: TP shards within a rank's slice, FSDP shards across data-parallel ranks.
A common optimization on top of TP: shard the sequence dimension between AllReduces. Requires explicit scatter and gather ops; reduces activation memory roughly proportional to TP size. See Megatron's "sequence parallel" mode.
n_heads % tp_size == 0 for attention).V1: stub at template/curry_train/primitives/tp_linear.py. References: Megatron-LM core/tensor_parallel/layers.py; PyTorch torch.distributed.tensor.parallel.
skills/primitive-parallel-state — provides the TP group.skills/primitive-gqattention — uses these inside attention.skills/stage4-parallel-primitive-intro — when to add TP.