Skill

primitive-topk-router

Top-K token routing for Mixture-of-Experts — for each token, pick the K experts with highest gating score. Activate when the user asks "MoE routing", "top-k router", "switch transformer routing", "expert choice", or builds an MoE model.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The "who decides which expert" half of an MoE block. Pairs with `primitive-experts` (the actual expert MLPs).

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · TopKRouter

The "who decides which expert" half of an MoE block. Pairs with primitive-experts (the actual expert MLPs).

What it does

For input tokens x of shape (B, N, D) and E experts:

Compute gating logits: logits = x @ W_g + b_g, shape (B, N, E).
Apply softmax along expert dim.
Pick the top-K experts per token.
Emit routing tensors: which token goes to which expert, with what weight.

Interface (V1 stub)

from curry_train.primitives import TopKRouter

router = TopKRouter(
    d_model=2048,
    n_experts=8,
    top_k=2,
    capacity_factor=1.25,    # over-provisioning to handle imbalance
    aux_loss_weight=0.01,    # load-balance auxiliary loss weight; 0 to disable
    jitter_noise=0.0,        # token-level noise during routing (training-time)
)

route_info = router(x)
# route_info contains:
#   expert_indices:   (B, N, K) which expert each token slot uses
#   expert_weights:   (B, N, K) gating weights for combining expert outputs
#   capacity_mask:    (B, N, K) which slots are kept under capacity
#   aux_loss:         scalar load-balance loss (added to main loss if > 0)

Variants

Standard top-K (Switch / Mixtral style): per token select K experts.
Expert Choice (Zoph et al.): per expert select tokens. Different routing semantics.
No-aux-loss MoE (Wang et al. 2024): drop the auxiliary load-balance loss; rely on routing dynamics.

V1 stub supports the standard top-K only; expert choice and no-aux variants are V2+.

Why capacity matters

Without capacity, one expert can be assigned far more tokens than others — impossible to batch efficiently. capacity_factor > 1 over-provisions space per expert; tokens that exceed capacity are dropped or routed to a fallback. Capacity is a function of expected token count and top_k.

Interaction with EP (expert parallel)

Once routing is decided, tokens are dispatched to the device that owns each expert via all-to-all. The router's output (expert indices) drives the all-to-all; see primitive-experts for the dispatch side.

Boundaries

The router is small (one linear layer); compute is negligible vs the experts. But its decisions dominate downstream comms cost.
Routing is non-differentiable in the discrete top-K step; gradients flow through the gating weights, not the indices.
Auxiliary load-balance loss biases routing toward balance, but at very high n_experts the routing dynamics need careful tuning.

Implementation status

V1: stub at template/curry_train/primitives/topk_router.py. References: Mixtral-of-Experts (Mistral AI), Switch Transformer (Fedus et al. 2021).

skills/primitive-experts — paired primitive that runs the expert MLPs.
skills/primitive-parallel-state — needed to know rank topology for EP.
Fedus et al. 2021, "Switch Transformer".
Jiang et al. 2024, "Mixtral 8x7B" (the most influential modern reference).

primitive-topk-router

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

primitive-topk-router

Tool Access

Preview

SKILL.md

Primitive · TopKRouter

What it does

Interface (V1 stub)

Variants

Why capacity matters

Interaction with EP (expert parallel)

Boundaries

Implementation status

Related

Similar Skills

Help us improve

Primitive · TopKRouter

What it does

Interface (V1 stub)

Variants

Why capacity matters

Interaction with EP (expert parallel)

Boundaries

Implementation status

Related