veda.ng
Back to Glossary

Transformer

' It's the foundation of modern LLMs. Before Transformers, neural networks used RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) to process sequences. These architectures process data sequentially, one token at a time, which is slow and loses long-range dependencies. Transformers introduced the attention mechanism.

Instead of processing sequentially, attention lets the model look at all positions in the input simultaneously and determine which positions are most relevant to each other. This is parallelizable. You can process massive batches of data in parallel, and it captures long-range dependencies. An attention head looks at a word and asks which other words are most relevant to understanding it.

Different attention heads specialize in different patterns. Some look at subject-verb relationships. Some track pronouns. Some identify topics. All in parallel. Transformers replaced RNNs and LSTMs almost entirely because they're faster to train and more capable. Every modern LLM is built on Transformers. The architecture is almost unchanged since 2017.

The improvements have come from scale: more parameters, more data, and longer training, not from architectural innovation. This stability is interesting. It suggests we've found something fundamental about how to process sequences.