Continuous batching + prefill/decode disaggregation

Section 23.2

Continuous batching + prefill/decode disaggregation

PagedAttention solved the memory problem. The next bottleneck is compute. Static batching — collect N requests, run them as a batch, wait for all to finish — wastes massive compute because requests have different lengths. The slowest request gates the whole batch; faster requests finish and sit idle. Continuous batching (Yu 2022 “Orca”) solves this by letting requests join and leave the batch at each token-generation step. Combined with prefill/decode disaggregation — running the two different phases on different GPUs — this is the architecture that powers every commercial LLM API in 2025.

Why static batching wastes compute

Imagine three requests in a static batch:

Request A: prompt 50 tokens, generates 200 tokens. Finishes step 200. Request B: prompt 100 tokens, generates 50 tokens. Finishes step 50. Request C: prompt 500 tokens, generates 1000 tokens. Finishes step 1000. Static batching pseudocode: while not all done: for each request in batch: generate one token (or pad with a NULL token for finished requests) step += 1 After step 50: B is done but stays in the batch (consuming a slot). After step 200: A is done; still in the batch. After step 1000: C finishes; batch can release all. Wasted compute: B's contribution after step 50: 950 steps of NULL processing = wasted. A's contribution after step 200: 800 steps wasted. C runs until 1000: not wasted, but A and B's slots are wasted for most of those. Net: ~half the per-step compute is wasted on padding for typical workloads.

The waste is structural. Different requests have different generation lengths; the slow ones force fast ones to wait. Static batching only works well when all requests are similar length, which is rarely true.

Continuous batching — requests enter and leave dynamically

Yu 2022 (“Orca: A Distributed Serving System for Transformer-Based Generative Models”) introduced the fix: at each token-generation step, REORGANISE the batch:

Continuous batching pseudocode: active_requests = [empty] while True: # Form the batch dynamically at each step. for each request in active_requests: compute one new token (using its current KV cache). # Check for completed requests: for each request that generated EOS or hit max_tokens: remove from active_requests free its KV cache blocks return the response to the user # Add new requests if room: while there's capacity (memory + compute): pull next request from pending queue prefill its prompt (more on this in §23.3) add to active_requests The key property: the BATCH COMPOSITION CHANGES at each step. No padding; no waiting; finished requests leave; new ones enter.

Continuous batching serving technique An LLM serving technique (Yu 2022, 'Orca') where the batch composition changes at every token-generation step. Finished requests are removed and free up their compute slots; new requests are added to fill open slots. Combined with PagedAttention (which makes the variable KV cache sizes tractable), it allows GPU utilisation to stay near 100% even with highly variable request lengths. The default in vLLM, SGLang, TensorRT-LLM. Replaces 'static batching' (which padded all requests to the longest length, wasting compute). dramatically improves GPU utilisation. The compute that would have been wasted on padding is reallocated to new requests joining the batch.

The challenge: at each step, you’re computing one new token for many active requests, each with a DIFFERENT context length and DIFFERENT KV cache. The attention kernel has to handle variable-length contexts within a single batch. This is where PagedAttention’s block tables become essential — each request’s KV cache is a list of blocks; the kernel just iterates through each request’s blocks.

Prefill vs decode — two distinct phases

For each request, generation has two phases with COMPLETELY DIFFERENT compute profiles:

PREFILL (process the input prompt): Input: prompt of N tokens. Work: forward pass over all N tokens at once. - All N positions compute their Q, K, V at every layer. - Attention is QKᵀ over N×N (or FlashAttention equivalent). Output: produces the FIRST output token + populates the full KV cache. Cost per request: O(N · d²) matmul + O(N² · d) attention. Compute regime: COMPUTE-BOUND (large matmuls on the N×d activations). Latency: 50-500 ms for a typical prompt. DECODE (generate one token at a time): Input: one new token (and the KV cache built during prefill). Work: forward pass for ONE position. - Compute Q, K, V for the new position only. - Attention: new Q against all cached K, V. Output: one new token; KV cache grows by one entry. Cost per token: O(d²) matmul + O(N · d) attention against cache. Compute regime: MEMORY-BOUND (small ops, dominated by reading model weights and KV cache). Latency per token: 10-50 ms.

Prefill LLM inference phase The first phase of LLM generation: process the user's input prompt and populate the KV cache. Cost per request: O(N · d²) compute + O(N² · d) attention where N is prompt length. Generally compute-bound because the matmuls are large (m = N tokens at once). Latency: 50-500ms for typical prompts. Produces the first output token. Distinct from the subsequent decode phase, which generates tokens one-at-a-time. and decode LLM inference phase The phase of LLM generation where output tokens are produced one-at-a-time, autoregressively. Each step: compute Q, K, V for the new position only; attend against the cached K, V from prefill (and prior decode steps); produce the next token. Per-token cost: O(d²) matmul + O(N·d) attention against cache. Generally memory-bound (small matmuls, dominated by HBM bandwidth for weight loading). Latency: 10-50ms per token. Continues until EOS or max_tokens reached. are fundamentally different from a compute perspective. Prefill is compute-bound (you can use full GPU FLOPs); decode is memory-bound (you’re limited by HBM bandwidth pulling weights).

— think, then check —

Data structures that change at each step:

1. Active request list: a dynamic list (vector / deque) tracking currently-batched requests. At each step:

Remove requests that finished (EOS or max_tokens).
Add new requests from the pending queue if memory + compute capacity allows.

2. Per-request state: each active request has:

Current generated tokens (a list).
Block table (pointers to its KV cache blocks).
Logit-bias parameters, sampling temperature, etc.
Generation step counter, max_tokens limit.

3. Pending queue: incoming requests waiting for a slot. Sorted by priority or FIFO.

Kernel changes required:

1. Variable-length attention: the attention kernel must handle a batch where each request has different context length AND different KV cache layout (block table). FlashAttention 2 has a “varlen” variant specifically for this. The kernel iterates per-request, reading from the request’s block-table-indexed memory.

2. Per-request sampling: after computing logits for all active requests in parallel, sample each request’s next token with its OWN parameters (different temperatures, top-p, etc.). This is a per-request branch in the sampling kernel.

3. Memory management: request finishes → block table freed → blocks returned to pool. Request starts → blocks allocated. Allocation has to be fast (often a free-list-based pool).

4. Batch packing: at each step, the active requests’ inputs and KV caches need to be efficiently laid out for the attention kernel. vLLM uses a “block table tensor” that the kernel reads to find each request’s blocks.

Scheduling logic:

1. When to add new requests: balance throughput (more requests = more parallelism) vs latency (more requests = larger batch = slower per-step).

2. How to handle preemption: if a long request blocks newer faster ones, can you “pause” it (save state, free blocks, resume later)? vLLM supports this for fairness.

3. Prefill vs decode interleaving: should new requests be prefilled IMMEDIATELY (interrupting decode), or batched with other prefills? Different systems make different choices (§23.3 covers disaggregation).

The implementation is non-trivial — vLLM’s scheduler alone is ~5K lines of code. But the benefit (2-5× throughput) is so large that every serious inference engine has built one.

↳ §23.2 + vLLM internals

Disaggregated serving — prefill on big GPUs, decode on small

The natural extension: since prefill and decode have different compute profiles, run them on different hardware.

Disaggregated serving (DistServe 2024, vLLM v0.6, others): Prefill cluster: - Few HIGH-FLOPs GPUs (H100s, B200s). - Designed for batched matmuls. - Each request: handle prefill, get first token + KV cache. - Stream the KV cache to a decode worker. Decode cluster: - Many MEMORY-RICH GPUs (could be cheaper hardware with more HBM). - Designed for memory-bound decoding. - Receives KV cache from prefill; continues generation token-by-token. Why disaggregation wins: - Prefill needs FLOPs (compute-bound). Concentrate that compute. - Decode needs HBM bandwidth (memory-bound). Use cheaper / more HBM hardware. - Different optimisation targets for different hardware. Implementation cost: KV cache transfer between prefill and decode is real. For a 32K-context prompt: ~10 GB of KV state to ship. Done via fast inter-GPU networking (NVLink, InfiniBand). Adds 10-50 ms latency to first-token-time. When it's worth it: - Mixed workload with long prefills (RAG, document QA): yes. - All-short workload (chat): often NOT worth it; the overhead exceeds the gain. - Cost-sensitive serving: yes; lets you use cheaper decode hardware.

Disaggregated serving serving architecture An LLM serving architecture that runs the prefill phase and decode phase on SEPARATE GPU clusters. Prefill (compute-bound) runs on FLOP-rich hardware; decode (memory-bound) runs on bandwidth-rich hardware. KV cache is transferred between them after prefill completes. Introduced by DistServe (2024); supported in vLLM v0.6+. Wins on mixed workloads with significant prefill (long prompts) by matching workload characteristics to hardware. Adds latency for the KV cache transfer (~10-50 ms) which is acceptable for long prompts but problematic for short chats. is the latest evolution. Not all systems use it yet, but it’s the direction high-end commercial LLM serving is heading.

— think, then check —

Why prefill is compute-bound:

Prefill processes N input tokens at once. The matmuls involved are:

QKV projection: X (N × d) · W (d × d) = ON × d) matmul. With N=4K and d=4096: m=4K, k=4K → arithmetic intensity ~4096. Far above ridge.
FFN: (N × d) · (d × F): m=N=4K, k=d=4K → high AI.
Attention QK^T: (N × d) · (d × N): N²d = 67 GFLOPs at N=4K. Also high AI.

All operations are large enough to saturate tensor cores. H100 utilisation: 50-70% of peak FLOPs (~500 TFLOPs of bf16). The GPU is being used for what it’s designed for.

Why decode is memory-bound:

Decode processes ONE token at a time. The matmuls involved are:

QKV projection: x (1 × d) · W (d × d) — this is a GEMV, not GEMM. m=1, k=d → AI ≈ 1 FLOP/byte. Far below ridge.
FFN: (1 × d) · (d × F): m=1 → AI ≈ 1. Memory-bound.
Attention against KV cache: q · K_cache (1 × N×d) — reads the KV cache from HBM. AI ≈ 1.

All operations are tiny. The actual FLOPs are minimal; the cost is loading weights and KV cache from HBM. H100 utilisation in decode: 5-15% of peak FLOPs. The compute units sit idle most of the time waiting for HBM.

Bandwidth math for decode:

Per generated token, need to load: ~70 GB of model weights (Llama 70B in fp16) + ~5-10 GB of KV cache (depending on context).

At 3.35 TB/s HBM: ~25 ms per token from bandwidth alone.

This explains why decode latency is ~30-50 ms per token on H100 for Llama 70B: it’s bottlenecked by HBM, not compute.

What this means for hardware:

Prefill cluster: H100 / B200 / Trainium — high FLOP density. Tensor cores doing 1 PFLOPs of work.
Decode cluster: could be cheaper hardware. AMD MI300X (more HBM, less FLOPs) might be better. Apple M2 Ultra (huge bandwidth, less FLOPs) is competitive. Cost-per-decoded-token is the metric.

The disaggregation thesis: spend FLOP money where it pays (prefill); spend bandwidth money where it pays (decode). Don’t pay for both on every GPU.

↳ §23.2 + LLM inference profiling

Continuous batching in practice — vLLM scheduler

vLLM’s serving loop is the canonical example:

vLLM main serving loop (simplified): while True: # 1. SCHEDULE: determine which requests run this step. scheduled_requests = [] # Add prefill candidates (constrained by memory). while pending_queue and have_memory_for_prefill(): req = pending_queue.pop() allocate KV blocks for its prompt scheduled_requests.append(req in 'prefill' state) # Add decode candidates (all active requests). for req in active_decode_requests: scheduled_requests.append(req in 'decode' state) # 2. EXECUTE: run the batch. # For prefill requests: full prompt processing. # For decode requests: one token each. # The model forward handles the heterogeneous batch (some prefill, some decode). logits = model.forward(scheduled_requests) # 3. SAMPLE: per-request sampling. next_tokens = [] for req, logit in zip(scheduled_requests, logits): token = sample(logit, req.sampling_params) next_tokens.append(token) # 4. UPDATE: append tokens, check completion. for req, token in zip(scheduled_requests, next_tokens): req.append_token(token) if req.is_complete(): free_kv_blocks(req) return_response(req) active_decode_requests.remove(req) else: if req.was_prefilling: req.transition_to_decode() active_decode_requests.add(req) # Loop: ~5-50 ms per iteration, depending on batch size and model.

— think, then check —

(a) Chat (100-token prompt, 200-token response):

Prefill: 100 tokens × ~30 GFLOPs/token = ~3 TFLOPs total. Takes ~3 ms on H100. Tiny.

Decode: 200 tokens × ~30 ms/token = 6 seconds.

Time profile: 99.95% of work is in decode.

Disaggregation overhead: KV cache transfer is ~100 tokens × 320 KB = 32 MB. Transfer takes ~1 ms. Small.

BUT: the per-request prefill is so cheap that the prefill cluster gets crushed by request volume. Need many prefill GPUs to handle the request rate.

Verdict: PROBABLY NOT WORTH IT. Static unified serving with continuous batching is fine. Disaggregation adds complexity without strong benefit.

(b) RAG (16K-token prompt, 500-token response):

Prefill: 16K tokens × ~30 GFLOPs/token = ~500 TFLOPs total. Takes ~500 ms on H100 (compute-bound).

Decode: 500 tokens × ~50 ms/token = 25 seconds.

Time profile: ~2% prefill, ~98% decode. Prefill is small fraction but ABSOLUTE prefill time (500ms) is noticeable.

Disaggregation overhead: KV cache transfer is 16K × 320 KB = 5 GB. At 50 GB/s InfiniBand: ~100 ms. Significant but acceptable.

Win: prefill happens on FLOP-rich hardware (fast); decode happens on bandwidth-rich hardware (cheaper); KV transfer is amortised across a long generation.

Verdict: WORTH IT. Especially if you have many concurrent RAG requests (the prefill compute is the binding constraint).

(c) Code completion (4K-token prompt, 2K-token response):

Prefill: 4K tokens × ~30 GFLOPs/token = ~120 TFLOPs. Takes ~120 ms.

Decode: 2K tokens × ~30 ms/token = 60 seconds.

Time profile: 0.2% prefill, 99.8% decode.

Disaggregation overhead: 4K × 320 KB = 1.3 GB. Takes ~25 ms.

Verdict: MARGINAL. The transfer overhead is significant relative to prefill time. If the workload is consistent, disaggregation pays; if mixed with chat, the architecture complexity may not be worth it.

The deeper rule:

Disaggregation wins when:

Prefill is a significant absolute cost (long prompts).
The prefill compute is the binding constraint (you’d need to add more H100s otherwise).
The decode infrastructure can use cheaper hardware.
The KV transfer overhead is amortised across a long generation.

For most “chat-only” workloads: unified serving + continuous batching + PagedAttention is enough. For mixed RAG/long-context workloads: disaggregation is the next step.

The trend: as context lengths grow (1M tokens becoming common), disaggregation will become more universal because long-prefill workloads benefit so much.

↳ §23.2 + DistServe + vLLM v0.6

Next: §23.3 — Speculative decoding deep dive. The third major inference optimisation: a draft model proposes K tokens, the target verifies them in parallel. 2-3× throughput at no quality cost.