veda.ng
Back to Glossary

Attention Head

Attention Head infographic

An attention head is one of multiple parallel attention mechanisms within a transformer layer, each independently learning different types of relationships between tokens in a sequence. In multi-head attention, the model doesn't compute attention just once, it computes it multiple times simultaneously through separate heads.

Each head has its own Query, Key, and Value projection matrices, allowing it to specialize in different patterns. One head might track syntactic dependencies like subject-verb agreement. Another might learn semantic relationships like pronoun coreference. A third might focus on positional patterns or long-range dependencies.

The diverse specialization emerges naturally through training without explicit programming. After computing attention independently, the outputs of all heads are concatenated and projected through a linear layer. This aggregation lets the model combine multiple relationship types into a unified representation.

The number of attention heads scales with model architecture: GPT-2 has 12-24 heads per layer, GPT-3 has 96 heads per layer. More heads increase representational capacity but also computational cost. Research has shown that attention heads are interpretable, visualization reveals meaningful patterns corresponding to linguistic phenomena.

Some heads become specialized for specific tasks while others remain general-purpose.

Interactive Visualizer

Multi-Head Attention Visualization

Explore how different attention heads in a transformer learn specialized patterns. Each head focuses on different types of relationships between words.

Subject-Verb Relations
Click tokens to see attention patterns • Hover for details
The
pos 0
cat
pos 1
sat
pos 2
on
pos 3
the
pos 4
mat
pos 5
Attention Weights for Head 1
The
cat
sat
on
the
mat
The
10
80
5
2
2
1
cat
30
20
40
5
3
2
sat
5
70
15
5
3
2
on
2
5
10
10
60
13
the
2
3
5
30
10
50
mat
1
2
2
15
30
50
Current Pattern Analysis
This head focuses on connecting subjects with their verbs (e.g., 'cat' → 'sat').