From curry-train
Top-K token routing for Mixture-of-Experts — for each token, pick the K experts with highest gating score. Activate when the user asks "MoE routing", "top-k router", "switch transformer routing", "expert choice", or builds an MoE model.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The "who decides which expert" half of an MoE block. Pairs with `primitive-experts` (the actual expert MLPs).
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The "who decides which expert" half of an MoE block. Pairs with primitive-experts (the actual expert MLPs).
For input tokens x of shape (B, N, D) and E experts:
logits = x @ W_g + b_g, shape (B, N, E).from curry_train.primitives import TopKRouter
router = TopKRouter(
d_model=2048,
n_experts=8,
top_k=2,
capacity_factor=1.25, # over-provisioning to handle imbalance
aux_loss_weight=0.01, # load-balance auxiliary loss weight; 0 to disable
jitter_noise=0.0, # token-level noise during routing (training-time)
)
route_info = router(x)
# route_info contains:
# expert_indices: (B, N, K) which expert each token slot uses
# expert_weights: (B, N, K) gating weights for combining expert outputs
# capacity_mask: (B, N, K) which slots are kept under capacity
# aux_loss: scalar load-balance loss (added to main loss if > 0)
V1 stub supports the standard top-K only; expert choice and no-aux variants are V2+.
Without capacity, one expert can be assigned far more tokens than others — impossible to batch efficiently. capacity_factor > 1 over-provisions space per expert; tokens that exceed capacity are dropped or routed to a fallback. Capacity is a function of expected token count and top_k.
Once routing is decided, tokens are dispatched to the device that owns each expert via all-to-all. The router's output (expert indices) drives the all-to-all; see primitive-experts for the dispatch side.
n_experts the routing dynamics need careful tuning.V1: stub at template/curry_train/primitives/topk_router.py. References: Mixtral-of-Experts (Mistral AI), Switch Transformer (Fedus et al. 2021).
skills/primitive-experts — paired primitive that runs the expert MLPs.skills/primitive-parallel-state — needed to know rank topology for EP.