Inference at scale

§1 KV cache + PagedAttention
During autoregressive generation, each layer caches the K and V vectors of every past token. The cache grows linearly with sequence length and dominates inference memory. Pre-vLLM systems allocated a FIXED max-context buffer per request, wasting 90%+ of memory on short requests. PagedAttention (Kwon 2023, vLLM) borrows the virtual-memory paging idea from operating systems: split the cache into fixed-size BLOCKS, allocate on demand, use a per-request block table to track. Result: 10-40× more concurrent requests per GPU.
§2 Continuous batching + prefill/decode disaggregation
After PagedAttention solves the memory problem, the next bottleneck is wasted compute. Static batching forces all requests to wait for the slowest one; many GPUs sit idle. Continuous batching (Orca, then vLLM) lets requests enter and exit the batch DYNAMICALLY — every iteration, finished requests leave and new ones join. Combined with prefill/decode disaggregation (different GPUs for the two phases), this is the recipe that makes commercial LLM serving viable.
§3 Speculative decoding — production deep dive
Decode is memory-bandwidth-bound (Ch.21). The bottleneck is loading model weights from HBM, not the compute itself. Speculative decoding (Leviathan 2022, Chen 2023) exploits this: a small draft model proposes K tokens; the target model VERIFIES all K in parallel in ONE forward pass (using the same weight-load it would have done anyway). Expected acceptance ~70-90%; effective speedup ~2-3×. The chapter that referenced Ch.S throughout the book — this is its production framing.

← ALL CHAPTERS