Skill

primitive-tp-linear

Tensor-parallel linear layers — column-parallel and row-parallel — for splitting matmuls across GPUs along the output or input feature dimension. Activate when the user asks "tensor parallel", "column parallel linear", "row parallel linear", "TP", or "split matmul across GPUs".

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The two building blocks that make a transformer tensor-parallel: `ColumnParallelLinear` (split along output dim, gather at end) and `RowParallelLinear` (split along input dim, all-reduce at end). All of attention's QKV/out and MLP's up/down can be expressed with these.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · TPLinear (Column / Row parallel)

The two building blocks that make a transformer tensor-parallel: ColumnParallelLinear (split along output dim, gather at end) and RowParallelLinear (split along input dim, all-reduce at end). All of attention's QKV/out and MLP's up/down can be expressed with these.

What it does

ColumnParallelLinear

Splits a Linear(in, out) along the output dimension across tp_size ranks. Each rank holds Linear(in, out/tp_size).

Forward: each rank computes its slice; if downstream needs the full vector, an AllGather is required.
Backward: gradient w.r.t. input is summed across ranks (AllReduce).

RowParallelLinear

Splits a Linear(in, out) along the input dimension. Each rank holds Linear(in/tp_size, out).

Forward: each rank computes a partial sum; AllReduce to combine.
Backward: gradient w.r.t. input is naturally per-rank (no comms).

Standard transformer pattern

attention QKV = ColumnParallel (no gather, output slice fed to attention)
attention out = RowParallel (input is per-rank; all-reduce after)
MLP up        = ColumnParallel
MLP down      = RowParallel

This pattern means two AllReduces per transformer block (one after attention out, one after MLP down). Communication cost is significant but predictable.

Interface (V1 stub)

from curry_train.primitives import ColumnParallelLinear, RowParallelLinear

q_proj = ColumnParallelLinear(
    in_features=2048,
    out_features=2048,
    tp_group=ps.get_tp_group(),
    bias=False,
    gather_output=False,   # downstream uses the slice; common in attention
)

o_proj = RowParallelLinear(
    in_features=2048,
    out_features=2048,
    tp_group=ps.get_tp_group(),
    bias=False,
    input_is_parallel=True,  # input is already per-rank
)

When to use

Models too large for a single GPU even with FSDP full and recompute.
Single matmul becomes the bottleneck (very wide models).

Almost always paired with FSDP for the data-parallel axis: TP shards within a rank's slice, FSDP shards across data-parallel ranks.

Sequence parallelism (variant)

A common optimization on top of TP: shard the sequence dimension between AllReduces. Requires explicit scatter and gather ops; reduces activation memory roughly proportional to TP size. See Megatron's "sequence parallel" mode.

Boundaries

TP communication is intra-rank-group; performance depends on NVLink/NVSwitch bandwidth. Across nodes (no NVLink), TP is much slower than within a node.
TP size should divide the relevant feature dimensions cleanly (e.g., n_heads % tp_size == 0 for attention).
Ranking your collectives matters: communication-overlap tricks (overlap forward AllGather with the previous backward AllReduce) can hide ~30% of comms cost.

Implementation status

V1: stub at template/curry_train/primitives/tp_linear.py. References: Megatron-LM core/tensor_parallel/layers.py; PyTorch torch.distributed.tensor.parallel.

skills/primitive-parallel-state — provides the TP group.
skills/primitive-gqattention — uses these inside attention.
skills/stage4-parallel-primitive-intro — when to add TP.
Shoeybi et al. 2019, "Megatron-LM" — the original TP design.

primitive-tp-linear

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

primitive-tp-linear

Tool Access

Preview

SKILL.md

Primitive · TPLinear (Column / Row parallel)

What it does

ColumnParallelLinear

RowParallelLinear

Standard transformer pattern

Interface (V1 stub)

When to use

Sequence parallelism (variant)

Boundaries

Implementation status

Related

Similar Skills

Help us improve

Primitive · TPLinear (Column / Row parallel)

What it does

ColumnParallelLinear

RowParallelLinear

Standard transformer pattern

Interface (V1 stub)

When to use

Sequence parallelism (variant)

Boundaries

Implementation status

Related