Beam search is a decoding algorithm for language models that maintains multiple candidate sequences in parallel, exploring the k most promising options at each generation step rather than greedily committing to the single best token.
The algorithm works by expanding each current candidate with all possible next tokens, scoring the results, and keeping only the top k candidates (the 'beam width') for the next step. A beam width of 1 is equivalent to greedy decoding and always selects the highest probability token.
Wider beams explore more of the output space, often finding higher-probability complete sequences that greedy decoding misses because they required lower-probability intermediate tokens. Beam search was required for neural machine translation, where greedy decoding often produced mediocre translations while beam search found much better ones.
The computational cost scales linearly with beam width. Beam width 10 requires roughly 10x the compute of greedy decoding. Variants include length normalization (preventing bias toward shorter sequences), diverse beam search (encouraging exploration of different output structures), and nucleus-constrained beam search.
For modern LLMs used in chat applications, beam search is less common than sampling methods because it tends to produce generic, repetitive text. But for deterministic tasks like translation or structured generation, it remains valuable.
Interactive Visualizer
Beam Search Decoding
Watch how language models explore multiple word sequences simultaneously
Beam width = 3 means the model keeps the top 3 most promising sequences at each step. More beams explore more possibilities but use more memory and computation.
Beam search balances between greedy decoding (always picking the top word) and exhaustive search (trying every combination). It keeps the top-k most promising sequences at each step, pruning unlikely paths while exploring multiple possibilities. Wider beams find better sequences but cost more computation.