Throughput measures the rate at which a system processes work over time, typically expressed as requests per second, tokens per second, or transactions per minute. It's distinct from but related to latency: latency measures how long each individual request takes, while throughput measures how many requests complete in a given period.
A system can have low latency but low throughput if it processes requests sequentially. Batching, grouping multiple requests and processing them together, is the primary technique for improving throughput. GPUs are highly parallel, and processing 32 requests together often takes only slightly longer than processing 1, greatly increasing throughput.
However, batching increases latency for individual requests because each must wait for the batch to complete. This creates a core tension: optimizing for throughput (large batches, high parallelism) conflicts with optimizing for latency (small batches, immediate response).
Production systems typically segment traffic: interactive users get low-latency processing with small batches, while batch workloads maximize throughput with large batches. Throughput also depends on hardware utilization, a system achieving only 50% GPU utilization has headroom to double throughput.
Continuous batching and speculative decoding are advanced techniques that maintain high throughput while keeping latency acceptable.
Interactive Visualizer
Throughput vs Latency
Compare sequential vs batched processing to understand how throughput differs from latency
Sequential processing handles one request at a time, while batched processing groups multiple requests together. Notice how batching improves throughput while keeping the same latency per individual request!