primitive-gqattention | curry-train | ClaudePluginHub

Skill

primitive-gqattention

From curry-train

Grouped-query attention with rotary positional embedding (RoPE). Standard component in modern LLMs (Llama-2/3, Qwen2/3, Mistral). Activate when the user asks "GQA", "grouped query attention", "RoPE", "rotary embedding", "attention with KV groups", or builds a transformer.

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Grouped-query attention with rotary positional embedding — a standard transformer attention block in 2024+ LLMs.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · GQAttention

Grouped-query attention with rotary positional embedding — a standard transformer attention block in 2024+ LLMs.

What it does

Q heads: H_q (e.g. 32 in Llama-2 70B).
KV heads: H_kv (e.g. 8). Each KV head is shared across H_q / H_kv Q heads.
Reduces KV cache memory by H_q / H_kv× during inference, with negligible quality loss.
Combined with RoPE (rotary positional embedding), no separate positional encoding tensor.

Interface (V1 stub)

from curry_train.primitives import GQAttention

attn = GQAttention(
    d_model=2048,
    n_heads=32,           # Q heads
    n_kv_heads=8,         # KV heads
    head_dim=64,
    rope_base=10000,
    rope_scaling=None,    # or LinearScaling(factor=2.0) etc.
    causal=True,
)

out = attn(x, attn_mask=None, kv_cache=None)

Variants this should accommodate

n_kv_heads = n_heads → standard MHA.
n_kv_heads = 1 → multi-query attention (MQA).
n_kv_heads in (1, ..., n_heads) → GQA.
RoPE base = 10000 (default), 500000 (long-context), 1e6 (very long).
RoPE scaling: linear, dynamic NTK, YaRN — all on by config.

Backend choices

Vanilla: F.scaled_dot_product_attention (default in 2024+).
FlashAttention 2: faster for long sequences; non-trivial to install.
Triton-flash: custom kernel; useful for non-standard variants (e.g., causal-spiking-linear).

The skill represents the primitive; the backend choice is a Strategy.

When to use

Any transformer that follows Llama / Qwen / Mistral architectures.
Don't use for: linear attention, additive attention, sliding-window attention — those are different primitives.

Interaction with TP

GQA splits cleanly across TP ranks: Q heads can be sharded by TP_size; KV heads can be replicated (if TP_size > n_kv_heads) or sharded. See primitive-tp-linear.

Boundaries

For SNN models that need the spike-time T dimension, use a separate spiking attention primitive — GQA is for continuous tensors (B, N, D).
For very long context (> 32k), pair with primitive-context-parallel.

Implementation status

V1: stub at template/curry_train/primitives/gqattention.py. Reference implementations in HuggingFace transformers/models/llama/modeling_llama.py (LlamaAttention) and flash_attn library.

Related

skills/primitive-tp-linear — column/row parallel linears used inside GQA when TP-sharded.
skills/primitive-rmsnorm — pre-norm before attention.
skills/primitive-context-parallel — for sequences > 32k.
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding".