Mixture of Experts

An architecture where a gating network routes each input to a small subset of specialized sub-models (experts), enabling massive parameter counts efficiently.

Mixture of Experts (MoE) models split their parameters into many "expert" subnetworks. A gating network picks a few experts to activate for each input token, so the full parameter count is huge but only a fraction is active per forward pass.

This enables models with hundreds of billions of parameters that run at the speed of much smaller dense models. Mixtral, GPT-4, and DeepSeek all use MoE architectures.

Example: Mixtral 8x7B has 47B total parameters but only ~13B are active per token. This gives dense-model quality at fraction-of-dense compute cost.

MoE's main tradeoff is memory: all experts must be loaded even though only a few are used per token. Training MoE also requires careful load balancing to ensure all experts get used.

Related Terms

← Back to Glossary