Feedforward Network

A feedforward network in the transformer architecture is a simple two-layer neural network applied independently and identically to each position in the sequence after the attention mechanism has mixed information across positions.

The typical structure is: linear projection from hidden dimension d to a larger intermediate dimension (often 4d), nonlinear activation function like GELU or SwiGLU, then linear projection back to d. This expand-activate-contract pattern provides nonlinearity and increases the model's capacity to learn complex functions.

Each position processes through the same feedforward weights, maintaining the position-independent property of transformers. The attention mechanism handles information mixing between positions; the feedforward network handles nonlinear transformation within each position.

Research suggests feedforward layers store factual knowledge: specific neurons activate for specific concepts, and editing these neurons can update the model's knowledge. The feedforward layers contain the majority of a transformer's parameters, often two-thirds of the total.

Modern architectures experiment with larger feedforward expansions, gated variants like SwiGLU that improve performance, and mixture-of-experts feedforward layers where different experts specialize in different inputs. Understanding the feedforward network's role clarifies the division of labor in transformers.

Interactive Visualizer

Feedforward Network Visualizer

Interactive visualization of the feedforward network in transformers. Click positions in the sequence and watch the expand-activate-contract pattern.

Hidden Dimension:4

Intermediate Dimension:16

Processing: "The" (Position 0)

Input (d=4)

Intermediate (4d=16)

GELU Activation

Output (d=4)

Current Step:

Input: Hidden representations from attention layer

Key Properties

• Applied identically to each position
• Expansion ratio typically 4x hidden dimension
• Provides computational capacity and nonlinearity
• No interaction between sequence positions

Mathematical Flow

x ∈ ℝᵈ (input)

W₁x + b₁ ∈ ℝ⁴ᵈ (expand)

GELU(W₁x + b₁) (activate)

W₂ · GELU(W₁x + b₁) + b₂ ∈ ℝᵈ (contract)

Related Terms

Transformer Neural Network