primitive-experts | curry-train | ClaudePluginHub

Skill

primitive-experts

From curry-train

A bank of parallel expert MLPs that consume routed tokens from TopKRouter and return per-token outputs. The "doer" half of an MoE block. Activate when the user asks "MoE experts", "expert MLPs", "Mixtral", "expert parallel", or builds an MoE block.

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The set of expert MLPs in an MoE block, plus the dispatch/combine logic that gathers tokens from the router and returns per-token outputs.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Primitive · Experts

The set of expert MLPs in an MoE block, plus the dispatch/combine logic that gathers tokens from the router and returns per-token outputs.

What it does

Given:

Tokens x: (B, N, D).
Routing info from primitive-topk-router: per-token expert indices and gating weights.

Steps:

Dispatch: scatter tokens to experts (locally or across EP-ranks via all-to-all).
Compute: each expert applies its MLP to its assigned tokens.
Combine: gather expert outputs back to per-token positions; apply gating weights.

Interface (V1 stub)

from curry_train.primitives import Experts

experts = Experts(
    d_model=2048,
    d_ff=8192,             # expert-internal hidden dim
    n_experts=8,
    activation="silu",
    glu=True,              # gated-linear-unit (Llama / Mixtral style)
    drop_path=0.0,
    ep_size=1,             # expert-parallel world size; 1 = no EP
)

y = experts(x, route_info)   # route_info from router
# y shape: (B, N, D)

EP (expert parallel) semantics

When ep_size > 1:

Token-to-expert all-to-all dispatch (each rank sends its tokens to the rank that owns the destination expert).
Local expert computation on each rank.
Output all-to-all back to source ranks.

This requires primitive-parallel-state to know rank topology and primitive-distributed-optimizer to handle expert-state sharding.

Variants

Top-K combine: weighted sum of K experts' outputs (default).
Top-1 / Switch: each token to a single expert. Cheaper but sometimes less expressive.
Shared experts: a small "always-on" expert seen by every token, in addition to top-K. Common in DeepSeek-MoE and recent Qwen.
GeGLU vs SiLU-GLU vs ReGLU: the activation in each expert. Default is SiLU-GLU (matches Llama).

Boundaries

Experts is the most memory-heavy primitive in MoE. With n_experts = 256 and d_ff = 8192, expert weights dominate. Sharding across EP-ranks is essential at scale.
The dispatch all-to-all is the dominant communication cost in MoE training. Token batching strategy (see aux_loss_weight in router) significantly affects comms cost.
For inference, only top-K experts run per token, but all weights must be in memory (or efficiently swapped). Inference-time MoE optimizations are out of V1 scope.

Implementation status

V1: stub at template/curry_train/primitives/experts.py. References: Mixtral (Jiang et al. 2024), DeepSeek-MoE (Dai et al. 2024).

Related

skills/primitive-topk-router — the upstream half.
skills/primitive-parallel-state — rank topology for EP.
skills/primitive-distributed-optimizer — sharded optimizer state for expert weights.
skills/stage4-parallel-primitive-intro — when to add EP.