INFERENCE AT SCALE
Section 23.2
02

Continuous batching + prefill/decode disaggregation

PagedAttention solved the memory problem. The next bottleneck is compute. Static batching — collect N requests, run them as a batch, wait for all to finish — wastes massive compute because requests have different lengths. The slowest request gates the whole batch; faster requests finish and sit idle. Continuous batching (Yu 2022 “Orca”) solves this by letting requests join and leave the batch at each token-generation step. Combined with prefill/decode disaggregation — running the two different phases on different GPUs — this is the architecture that powers every commercial LLM API in 2025.

Why static batching wastes compute

Imagine three requests in a static batch:

Request A: prompt 50 tokens, generates 200 tokens. Finishes step 200. Request B: prompt 100 tokens, generates 50 tokens. Finishes step 50. Request C: prompt 500 tokens, generates 1000 tokens. Finishes step 1000. Static batching pseudocode: while not all done: for each request in batch: generate one token (or pad with a NULL token for finished requests) step += 1 After step 50: B is done but stays in the batch (consuming a slot). After step 200: A is done; still in the batch. After step 1000: C finishes; batch can release all. Wasted compute: B's contribution after step 50: 950 steps of NULL processing = wasted. A's contribution after step 200: 800 steps wasted. C runs until 1000: not wasted, but A and B's slots are wasted for most of those. Net: ~half the per-step compute is wasted on padding for typical workloads.

The waste is structural. Different requests have different generation lengths; the slow ones force fast ones to wait. Static batching only works well when all requests are similar length, which is rarely true.

Continuous batching — requests enter and leave dynamically

Yu 2022 (“Orca: A Distributed Serving System for Transformer-Based Generative Models”) introduced the fix: at each token-generation step, REORGANISE the batch:

Continuous batching pseudocode: active_requests = [empty] while True: # Form the batch dynamically at each step. for each request in active_requests: compute one new token (using its current KV cache). # Check for completed requests: for each request that generated EOS or hit max_tokens: remove from active_requests free its KV cache blocks return the response to the user # Add new requests if room: while there's capacity (memory + compute): pull next request from pending queue prefill its prompt (more on this in §23.3) add to active_requests The key property: the BATCH COMPOSITION CHANGES at each step. No padding; no waiting; finished requests leave; new ones enter.

Continuous batching dramatically improves GPU utilisation. The compute that would have been wasted on padding is reallocated to new requests joining the batch.

The challenge: at each step, you’re computing one new token for many active requests, each with a DIFFERENT context length and DIFFERENT KV cache. The attention kernel has to handle variable-length contexts within a single batch. This is where PagedAttention’s block tables become essential — each request’s KV cache is a list of blocks; the kernel just iterates through each request’s blocks.

Prefill vs decode — two distinct phases

For each request, generation has two phases with COMPLETELY DIFFERENT compute profiles:

PREFILL (process the input prompt): Input: prompt of N tokens. Work: forward pass over all N tokens at once. - All N positions compute their Q, K, V at every layer. - Attention is QKᵀ over N×N (or FlashAttention equivalent). Output: produces the FIRST output token + populates the full KV cache. Cost per request: O(N · d²) matmul + O(N² · d) attention. Compute regime: COMPUTE-BOUND (large matmuls on the N×d activations). Latency: 50-500 ms for a typical prompt. DECODE (generate one token at a time): Input: one new token (and the KV cache built during prefill). Work: forward pass for ONE position. - Compute Q, K, V for the new position only. - Attention: new Q against all cached K, V. Output: one new token; KV cache grows by one entry. Cost per token: O(d²) matmul + O(N · d) attention against cache. Compute regime: MEMORY-BOUND (small ops, dominated by reading model weights and KV cache). Latency per token: 10-50 ms.

Prefill and decode are fundamentally different from a compute perspective. Prefill is compute-bound (you can use full GPU FLOPs); decode is memory-bound (you’re limited by HBM bandwidth pulling weights).

— think, then check —

Data structures that change at each step:

1. Active request list: a dynamic list (vector / deque) tracking currently-batched requests. At each step:

  • Remove requests that finished (EOS or max_tokens).
  • Add new requests from the pending queue if memory + compute capacity allows.

2. Per-request state: each active request has:

  • Current generated tokens (a list).
  • Block table (pointers to its KV cache blocks).
  • Logit-bias parameters, sampling temperature, etc.
  • Generation step counter, max_tokens limit.

3. Pending queue: incoming requests waiting for a slot. Sorted by priority or FIFO.

Kernel changes required:

1. Variable-length attention: the attention kernel must handle a batch where each request has different context length AND different KV cache layout (block table). FlashAttention 2 has a “varlen” variant specifically for this. The kernel iterates per-request, reading from the request’s block-table-indexed memory.

2. Per-request sampling: after computing logits for all active requests in parallel, sample each request’s next token with its OWN parameters (different temperatures, top-p, etc.). This is a per-request branch in the sampling kernel.

3. Memory management: request finishes → block table freed → blocks returned to pool. Request starts → blocks allocated. Allocation has to be fast (often a free-list-based pool).

4. Batch packing: at each step, the active requests’ inputs and KV caches need to be efficiently laid out for the attention kernel. vLLM uses a “block table tensor” that the kernel reads to find each request’s blocks.

Scheduling logic:

1. When to add new requests: balance throughput (more requests = more parallelism) vs latency (more requests = larger batch = slower per-step).

2. How to handle preemption: if a long request blocks newer faster ones, can you “pause” it (save state, free blocks, resume later)? vLLM supports this for fairness.

3. Prefill vs decode interleaving: should new requests be prefilled IMMEDIATELY (interrupting decode), or batched with other prefills? Different systems make different choices (§23.3 covers disaggregation).

The implementation is non-trivial — vLLM’s scheduler alone is ~5K lines of code. But the benefit (2-5× throughput) is so large that every serious inference engine has built one.

Disaggregated serving — prefill on big GPUs, decode on small

The natural extension: since prefill and decode have different compute profiles, run them on different hardware.

Disaggregated serving (DistServe 2024, vLLM v0.6, others): Prefill cluster: - Few HIGH-FLOPs GPUs (H100s, B200s). - Designed for batched matmuls. - Each request: handle prefill, get first token + KV cache. - Stream the KV cache to a decode worker. Decode cluster: - Many MEMORY-RICH GPUs (could be cheaper hardware with more HBM). - Designed for memory-bound decoding. - Receives KV cache from prefill; continues generation token-by-token. Why disaggregation wins: - Prefill needs FLOPs (compute-bound). Concentrate that compute. - Decode needs HBM bandwidth (memory-bound). Use cheaper / more HBM hardware. - Different optimisation targets for different hardware. Implementation cost: KV cache transfer between prefill and decode is real. For a 32K-context prompt: ~10 GB of KV state to ship. Done via fast inter-GPU networking (NVLink, InfiniBand). Adds 10-50 ms latency to first-token-time. When it's worth it: - Mixed workload with long prefills (RAG, document QA): yes. - All-short workload (chat): often NOT worth it; the overhead exceeds the gain. - Cost-sensitive serving: yes; lets you use cheaper decode hardware.

Disaggregated serving is the latest evolution. Not all systems use it yet, but it’s the direction high-end commercial LLM serving is heading.

— think, then check —

Why prefill is compute-bound:

Prefill processes N input tokens at once. The matmuls involved are:

  • QKV projection: X (N × d) · W (d × d) = ON × d) matmul. With N=4K and d=4096: m=4K, k=4K → arithmetic intensity ~4096. Far above ridge.
  • FFN: (N × d) · (d × F): m=N=4K, k=d=4K → high AI.
  • Attention QK^T: (N × d) · (d × N): N²d = 67 GFLOPs at N=4K. Also high AI.

All operations are large enough to saturate tensor cores. H100 utilisation: 50-70% of peak FLOPs (~500 TFLOPs of bf16). The GPU is being used for what it’s designed for.

Why decode is memory-bound:

Decode processes ONE token at a time. The matmuls involved are:

  • QKV projection: x (1 × d) · W (d × d) — this is a GEMV, not GEMM. m=1, k=d → AI ≈ 1 FLOP/byte. Far below ridge.
  • FFN: (1 × d) · (d × F): m=1 → AI ≈ 1. Memory-bound.
  • Attention against KV cache: q · K_cache (1 × N×d) — reads the KV cache from HBM. AI ≈ 1.

All operations are tiny. The actual FLOPs are minimal; the cost is loading weights and KV cache from HBM. H100 utilisation in decode: 5-15% of peak FLOPs. The compute units sit idle most of the time waiting for HBM.

Bandwidth math for decode:

Per generated token, need to load: ~70 GB of model weights (Llama 70B in fp16) + ~5-10 GB of KV cache (depending on context).

At 3.35 TB/s HBM: ~25 ms per token from bandwidth alone.

This explains why decode latency is ~30-50 ms per token on H100 for Llama 70B: it’s bottlenecked by HBM, not compute.

What this means for hardware:

  • Prefill cluster: H100 / B200 / Trainium — high FLOP density. Tensor cores doing 1 PFLOPs of work.
  • Decode cluster: could be cheaper hardware. AMD MI300X (more HBM, less FLOPs) might be better. Apple M2 Ultra (huge bandwidth, less FLOPs) is competitive. Cost-per-decoded-token is the metric.

The disaggregation thesis: spend FLOP money where it pays (prefill); spend bandwidth money where it pays (decode). Don’t pay for both on every GPU.

Continuous batching in practice — vLLM scheduler

vLLM’s serving loop is the canonical example:

vLLM main serving loop (simplified): while True: # 1. SCHEDULE: determine which requests run this step. scheduled_requests = [] # Add prefill candidates (constrained by memory). while pending_queue and have_memory_for_prefill(): req = pending_queue.pop() allocate KV blocks for its prompt scheduled_requests.append(req in 'prefill' state) # Add decode candidates (all active requests). for req in active_decode_requests: scheduled_requests.append(req in 'decode' state) # 2. EXECUTE: run the batch. # For prefill requests: full prompt processing. # For decode requests: one token each. # The model forward handles the heterogeneous batch (some prefill, some decode). logits = model.forward(scheduled_requests) # 3. SAMPLE: per-request sampling. next_tokens = [] for req, logit in zip(scheduled_requests, logits): token = sample(logit, req.sampling_params) next_tokens.append(token) # 4. UPDATE: append tokens, check completion. for req, token in zip(scheduled_requests, next_tokens): req.append_token(token) if req.is_complete(): free_kv_blocks(req) return_response(req) active_decode_requests.remove(req) else: if req.was_prefilling: req.transition_to_decode() active_decode_requests.add(req) # Loop: ~5-50 ms per iteration, depending on batch size and model.
— think, then check —

(a) Chat (100-token prompt, 200-token response):

Prefill: 100 tokens × ~30 GFLOPs/token = ~3 TFLOPs total. Takes ~3 ms on H100. Tiny.

Decode: 200 tokens × ~30 ms/token = 6 seconds.

Time profile: 99.95% of work is in decode.

Disaggregation overhead: KV cache transfer is ~100 tokens × 320 KB = 32 MB. Transfer takes ~1 ms. Small.

BUT: the per-request prefill is so cheap that the prefill cluster gets crushed by request volume. Need many prefill GPUs to handle the request rate.

Verdict: PROBABLY NOT WORTH IT. Static unified serving with continuous batching is fine. Disaggregation adds complexity without strong benefit.

(b) RAG (16K-token prompt, 500-token response):

Prefill: 16K tokens × ~30 GFLOPs/token = ~500 TFLOPs total. Takes ~500 ms on H100 (compute-bound).

Decode: 500 tokens × ~50 ms/token = 25 seconds.

Time profile: ~2% prefill, ~98% decode. Prefill is small fraction but ABSOLUTE prefill time (500ms) is noticeable.

Disaggregation overhead: KV cache transfer is 16K × 320 KB = 5 GB. At 50 GB/s InfiniBand: ~100 ms. Significant but acceptable.

Win: prefill happens on FLOP-rich hardware (fast); decode happens on bandwidth-rich hardware (cheaper); KV transfer is amortised across a long generation.

Verdict: WORTH IT. Especially if you have many concurrent RAG requests (the prefill compute is the binding constraint).

(c) Code completion (4K-token prompt, 2K-token response):

Prefill: 4K tokens × ~30 GFLOPs/token = ~120 TFLOPs. Takes ~120 ms.

Decode: 2K tokens × ~30 ms/token = 60 seconds.

Time profile: 0.2% prefill, 99.8% decode.

Disaggregation overhead: 4K × 320 KB = 1.3 GB. Takes ~25 ms.

Verdict: MARGINAL. The transfer overhead is significant relative to prefill time. If the workload is consistent, disaggregation pays; if mixed with chat, the architecture complexity may not be worth it.

The deeper rule:

Disaggregation wins when:

  • Prefill is a significant absolute cost (long prompts).
  • The prefill compute is the binding constraint (you’d need to add more H100s otherwise).
  • The decode infrastructure can use cheaper hardware.
  • The KV transfer overhead is amortised across a long generation.

For most “chat-only” workloads: unified serving + continuous batching + PagedAttention is enough. For mixed RAG/long-context workloads: disaggregation is the next step.

The trend: as context lengths grow (1M tokens becoming common), disaggregation will become more universal because long-prefill workloads benefit so much.

Next: §23.3 — Speculative decoding deep dive. The third major inference optimisation: a draft model proposes K tokens, the target verifies them in parallel. 2-3× throughput at no quality cost.