Mixture of Experts is a neural network architecture where only a fraction of the model's parameters are active for any given input, routing each token to the subset of 'expert' networks most relevant to it. Instead of every token passing through all layers, a router network decides which experts to activate. GPT-4 and Mixtral use MoE architectures.
The advantage is parameter efficiency: you can have a model with 1 trillion total parameters, but only 50 billion active for any given inference. This gives you the capacity of a massive model at the computational cost of a smaller one.
The trade-off is memory: all expert parameters must be loaded into memory even though only some are used, requiring more GPU memory than a dense model of equivalent active parameters. Training MoE models is also harder, load balancing between experts is a persistent challenge, as the router tends to over-route to a few experts and underuse others.
But for inference at scale, MoE is increasingly dominant because it delivers high capability at manageable compute cost.
Interactive Visualizer
Mixture of Experts (MoE)
Click on tokens to see how the router network selectively activates only the most relevant expert networks, achieving parameter efficiency by using a fraction of the total model capacity.
Input Tokens
Expert Networks
Grammar
Articles, pronouns
Animals
Animal-related words
Actions
Verbs, adverbs
Objects
Nouns, things
Descriptors
Adjectives
Spatial
Location, direction
Parameter Efficiency
Only 0.0% of model parameters active per token