veda.ng

KV cache (Key-Value cache) stores the computed Key and Value vectors from previous tokens during autoregressive generation, avoiding redundant computation and greatly speeding up inference. In transformer attention, generating token N requires computing attention with all previous tokens 0 through N-1.

Without caching, this means recomputing Keys and Values for all previous tokens at every generation step, quadratic complexity as the sequence grows. KV caching stores the Key and Value vectors after computing them once. When generating token N, you only compute Query, Key, and Value for the new token, then attend over the cached Keys and Values from all previous tokens.

This reduces complexity from quadratic to linear in sequence length. The memory cost is major: KV cache grows with sequence length times model dimension times number of layers times number of attention heads times 2 (for K and V). For large models with long contexts, KV cache can consume tens of gigabytes.

Memory optimization techniques include KV cache quantization (storing in lower precision), paged attention (virtual memory for KV cache), and sliding window attention (caching only recent tokens). Understanding KV cache is required for deploying LLMs efficiently, it's often the primary memory bottleneck during inference with long contexts.

Interactive Visualizer

KV Cache Visualization

See how caching Key-Value vectors reduces computation during autoregressive generation

Token Generation

The
Token 0

Attention Computation

Performance Impact

0
Total Computations
0
Cached Vectors
0%
Computation Saved
Legend:
Recomputed
Cached
New computation