A feedforward network in the transformer architecture is a simple two-layer neural network applied independently and identically to each position in the sequence after the attention mechanism has mixed information across positions.
The typical structure is: linear projection from hidden dimension d to a larger intermediate dimension (often 4d), nonlinear activation function like GELU or SwiGLU, then linear projection back to d. This expand-activate-contract pattern provides nonlinearity and increases the model's capacity to learn complex functions.
Each position processes through the same feedforward weights, maintaining the position-independent property of transformers. The attention mechanism handles information mixing between positions; the feedforward network handles nonlinear transformation within each position.
Research suggests feedforward layers store factual knowledge: specific neurons activate for specific concepts, and editing these neurons can update the model's knowledge. The feedforward layers contain the majority of a transformer's parameters, often two-thirds of the total.
Modern architectures experiment with larger feedforward expansions, gated variants like SwiGLU that improve performance, and mixture-of-experts feedforward layers where different experts specialize in different inputs. Understanding the feedforward network's role clarifies the division of labor in transformers.
Interactive Visualizer
Feedforward Network Visualizer
Interactive visualization of the feedforward network in transformers. Click positions in the sequence and watch the expand-activate-contract pattern.
Key Properties
- • Applied identically to each position
- • Expansion ratio typically 4x hidden dimension
- • Provides computational capacity and nonlinearity
- • No interaction between sequence positions