From curry-train
Grouped-query attention with rotary positional embedding (RoPE). Standard component in modern LLMs (Llama-2/3, Qwen2/3, Mistral). Activate when the user asks "GQA", "grouped query attention", "RoPE", "rotary embedding", "attention with KV groups", or builds a transformer.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Grouped-query attention with rotary positional embedding — a standard transformer attention block in 2024+ LLMs.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Grouped-query attention with rotary positional embedding — a standard transformer attention block in 2024+ LLMs.
H_q (e.g. 32 in Llama-2 70B).H_kv (e.g. 8). Each KV head is shared across H_q / H_kv Q heads.H_q / H_kv× during inference, with negligible quality loss.from curry_train.primitives import GQAttention
attn = GQAttention(
d_model=2048,
n_heads=32, # Q heads
n_kv_heads=8, # KV heads
head_dim=64,
rope_base=10000,
rope_scaling=None, # or LinearScaling(factor=2.0) etc.
causal=True,
)
out = attn(x, attn_mask=None, kv_cache=None)
n_kv_heads = n_heads → standard MHA.n_kv_heads = 1 → multi-query attention (MQA).n_kv_heads in (1, ..., n_heads) → GQA.F.scaled_dot_product_attention (default in 2024+).The skill represents the primitive; the backend choice is a Strategy.
GQA splits cleanly across TP ranks: Q heads can be sharded by TP_size; KV heads can be replicated (if TP_size > n_kv_heads) or sharded. See primitive-tp-linear.
T dimension, use a separate spiking attention primitive — GQA is for continuous tensors (B, N, D).primitive-context-parallel.V1: stub at template/curry_train/primitives/gqattention.py. Reference implementations in HuggingFace transformers/models/llama/modeling_llama.py (LlamaAttention) and flash_attn library.
skills/primitive-tp-linear — column/row parallel linears used inside GQA when TP-sharded.skills/primitive-rmsnorm — pre-norm before attention.skills/primitive-context-parallel — for sequences > 32k.