Models & ArchitectureMoE
Mixture of Experts
An architecture where a gating network routes each input to a small subset of specialized sub-models (experts), enabling massive parameter counts efficiently.
Mixture of Experts (MoE) models split their parameters into many "expert" subnetworks. A gating network picks a few experts to activate for each input token, so the full parameter count is huge but only a fraction is active per forward pass.
This enables models with hundreds of billions of parameters that run at the speed of much smaller dense models. Mixtral, GPT-4, and DeepSeek all use MoE architectures.
Example: Mixtral 8x7B has 47B total parameters but only ~13B are active per token. This gives dense-model quality at fraction-of-dense compute cost.
MoE's main tradeoff is memory: all experts must be loaded even though only a few are used per token. Training MoE also requires careful load balancing to ensure all experts get used.