From curry-train
A bank of parallel expert MLPs that consume routed tokens from TopKRouter and return per-token outputs. The "doer" half of an MoE block. Activate when the user asks "MoE experts", "expert MLPs", "Mixtral", "expert parallel", or builds an MoE block.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The set of expert MLPs in an MoE block, plus the dispatch/combine logic that gathers tokens from the router and returns per-token outputs.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The set of expert MLPs in an MoE block, plus the dispatch/combine logic that gathers tokens from the router and returns per-token outputs.
Given:
x: (B, N, D).primitive-topk-router: per-token expert indices and gating weights.Steps:
from curry_train.primitives import Experts
experts = Experts(
d_model=2048,
d_ff=8192, # expert-internal hidden dim
n_experts=8,
activation="silu",
glu=True, # gated-linear-unit (Llama / Mixtral style)
drop_path=0.0,
ep_size=1, # expert-parallel world size; 1 = no EP
)
y = experts(x, route_info) # route_info from router
# y shape: (B, N, D)
When ep_size > 1:
This requires primitive-parallel-state to know rank topology and primitive-distributed-optimizer to handle expert-state sharding.
n_experts = 256 and d_ff = 8192, expert weights dominate. Sharding across EP-ranks is essential at scale.aux_loss_weight in router) significantly affects comms cost.V1: stub at template/curry_train/primitives/experts.py. References: Mixtral (Jiang et al. 2024), DeepSeek-MoE (Dai et al. 2024).
skills/primitive-topk-router — the upstream half.skills/primitive-parallel-state — rank topology for EP.skills/primitive-distributed-optimizer — sharded optimizer state for expert weights.skills/stage4-parallel-primitive-intro — when to add EP.