The autoregressive bottleneck & speculative decoding
If you’ve ever watched a large language model “stream” its output a word at a time, you’ve watched the autoregressive bottleneck happen. Every token requires a complete forward pass through tens of billions of parameters. Those weights have to be loaded from HBM (Ch.3 §2; Ch.20 will revisit) into the GPU’s compute units once per token, which means the bottleneck is memory bandwidth — not compute. The GPU’s matrix-multiply units sit roughly idle while waiting for weights. Speculative decoding is the inference-time trick that breaks this bottleneck: have a small fast “draft” model propose several tokens in parallel, then have the big “verifier” model check them all in a single forward pass. When the draft agrees with the verifier, you get multiple tokens per pass — and the math guarantees the output distribution is identical to plain decoding. The headline result, Leviathan, Kalman & Matias 2023 (“Fast Inference from Transformers via Speculative Decoding,” ICML), is a clean 2–3× speedup with no model changes. Every modern inference stack (vLLM, TensorRT-LLM, MLX, llama.cpp) implements it. This section is the math; the next is the modern variants (Medusa, EAGLE, MTP); §3 walks llama.cpp’s PR #22673 in context.
The bottleneck, exactly
A transformer LLM with P parameters needs to read P floats from memory per generated token (we’ll re-derive this in Ch.22). For Qwen3.6-27B in 4-bit quantization that’s ~14 GB per token. Apple Silicon M-class chips have ~800 GB/s of memory bandwidth; an H100 has ~3 TB/s. The wall-clock per token is approximately:
The arithmetic itself takes microseconds. The bottleneck is moving weights from HBM (or unified memory, on Apple Silicon) into the registers that compute the matrix multiply for each layer. Past 30B parameters, you’re bandwidth-bound on essentially any consumer hardware, and the only way to get more tokens per second is to do more useful work per memory read.
That observation is the core of every speedup technique in modern inference:
- Batching (Ch.22): serve many users at once, each token-step reuses the loaded weights for B users → throughput scales nearly linearly with batch size, up to compute saturation.
- Quantization (Ch.3 §3): make the weights smaller — 4-bit halves the bytes moved → almost halves the wall-clock per token.
- Speculative decoding (this section): produce more tokens per weight-loading round → ~2–3× without retraining.
Speculative decoding is special because the speedup is essentially free: same model, same hardware, same output distribution. You can stack it on top of quantization and batching for a multiplicative effect. (vLLM benchmarks routinely report 6–10× over naïve inference when all three are combined.)
An LLM is autoregressive — each token’s prediction depends on all previous tokens. So token t+1 can’t be computed until token t is known; you can’t parallelize across positions during generation (you can during training, where all positions are known).
Per token, the GPU must load every weight at least once from HBM into compute units. For 14 GB of weights on 800 GB/s memory: 14 GB / 800 GB/s = 17.5 ms per token = ~57 tokens/sec. The actual arithmetic (~2P FLOPs per token = ~28 GFLOPs for 14B param model) takes microseconds on a chip with TFLOPS-class throughput. So weight movement dominates wall-clock by ~3 orders of magnitude — the kernel is memory-bound, not compute-bound.
This is why every inference speedup either (a) reduces bytes moved per token (quantization), (b) reuses loaded bytes across more useful work (batching, speculative decoding), or (c) skips moving some weights entirely (mixture-of-experts).
The speculative-decoding algorithm
The setup: you have a target model M_q (the big one whose output you actually want — Qwen3.6-27B, GPT-5, etc.) and a draft model M_p (smaller, faster, often a smaller version of the same model family — e.g., Qwen2.5-0.5B drafting for Qwen3.6-27B).
The core idea is one round of speculation followed by one round of verification:
- Draft. Starting from the current context, the draft model autoregressively generates K candidate tokens t_1, t_2, …, t_K. This costs K small forward passes.
- Verify. The target model runs one forward pass over the original context plus all K draft tokens, computing logits at positions 1, 2, …, K, K+1 simultaneously (this works because at training time the model already processes all positions in parallel — only generation is sequential).
- Accept or reject each draft token in order, using the rejection-sampling rule below. Stop at the first rejection.
- Bonus token. After the accepted prefix, the verifier’s own logits at the rejection point (or at position K+1 if all were accepted) are sampled to produce one more token.
- Append the accepted prefix plus the bonus token to the output and loop back to step 1.
The result of step 5 is at least 1 token (the bonus) and at most K + 1 tokens (full acceptance) per round, at a wall-clock cost of one verifier pass plus K draft passes. When the draft and verifier agree often, the per-token cost drops dramatically.
Drag the per-token accept probability up. Below p ≈ 0.5 there’s barely a win; above p ≈ 0.7 you start seeing the published 2× speedups; at p ≈ 0.95 (the regime good MTP heads operate in, §S.2) you can get 4×+.
The acceptance rule — why this preserves the output distribution
Here is where speculative decoding gets clever, and the reason it doesn’t change the model’s behaviour. For each draft token t_i, let p(t_i) be the target model’s probability for t_i at that position, and q(t_i) be the draft model’s probability. The acceptance rule is:
This is the Leviathan rejection rule. Its critical property: the final token generated at any position has the same probability distribution it would have had under standard sampling from the target model.
The one-line proof: by the rejection-sampling identity, the probability of finally emitting token t at any position is:
Both terms add up to exactly the target distribution. So sampling speculatively is sampling-equivalent to sampling from the target, just often cheaper. The model behaves the same way; you just got there faster.
The conceptual unlock. Speculative decoding is not an approximation. It is exact sampling from the target model, dressed up as a clever Monte-Carlo trick. Anywhere you see “lossless speedup” in the literature — that’s what it means. Quantization is approximate (Ch.3 §3 trades precision for speed); speculative decoding isn’t (it trades draft compute for verifier wall-clock with mathematical equality on the output distribution).
’Lossless’ means the speculatively-generated tokens follow the same probability distribution as tokens generated by plain sampling from the target model. Not ‘approximately the same’; exactly the same.
One-line argument: at any position, the probability of finally emitting token t is q(t)·min(1, p(t)/q(t)) + (1−A)·p_rep(t), where q is the draft’s probability, p is the target’s, A is the acceptance probability, and p_rep is the residual distribution sampled on rejection. Adding the two terms cancels through to exactly p(t).
So the model behaves identically; speculative decoding is purely a wall-clock optimisation. This is why no quality regression is observed in practice — and why it can be safely combined with quantization (which IS lossy) without compounding the quality loss.
Acceptance, math, expected speedup
Let p be the per-token acceptance probability (the expected value of min(1, p(t)/q(t)) averaged over the draft’s distribution). For a draft of length K with i.i.d. acceptance:
Plus the bonus token, the expected output per round is p(1 − pᴷ)/(1 − p) + 1. If the draft costs c_d verifier-passes per draft token, the cost per round is 1 + K · c_d. So the speedup over plain decoding (which produces 1 token per verifier pass) is:
For a typical setup c_d = 0.1 (draft is 10× faster than verifier):
Speculative decoding: expected vs simulated tokens per round
c_draft = 0.10, 100,000 Monte-Carlo rounds
p = 0.70
K=1 E[tokens]=1.70 sim=1.70 speedup=1.55x
K=4 E[tokens]=2.77 sim=2.77 speedup=1.98x
K=6 E[tokens]=2.94 sim=3.06 speedup=1.84x
→ optimal K = 4–5, speedup ~2.0x
p = 0.95
K=4 E[tokens]=4.52 sim=4.52 speedup=3.23x
K=8 E[tokens]=7.40 sim=7.40 speedup=4.11x
→ optimal K = 8, speedup 4.1x
Theory and simulation agree to three decimals at every grid point. Two operational rules fall out:
- Increase K until draft overhead bites. Each additional draft token costs c_d deterministically but contributes diminishing expected tokens (p^k shrinks fast). The optimal K is where the marginal gain crosses the marginal cost — typically K = 4–8.
- The acceptance probability p is the entire game. At p = 0.5, best speedup is ~2.1×. At p = 0.95, it’s ~4.1×. The whole research line of better drafts (Medusa, EAGLE, MTP — §S.2) is essentially: raise p.
* Returns mean tokens per round (Monte-Carlo). */
static double simulate(double p, int K, long rounds) {
long total_tokens = 0;
for (long r = 0; r < rounds; r++) {
int accepted = 0;
for (int i = 0; i < K; i++) {
if (urand() < p) accepted += 1;
else break;
}
total_tokens += accepted + 1; /* +1 = verifier's own continuation */
}The “i.i.d. acceptance” assumption is a simplification. In real LLMs, acceptance probabilities correlate: if the first draft token agrees, the remaining ones are more likely to agree (because the draft is on the right track). Empirical run-length distributions are slightly more bimodal than the geometric assumed above. The Leviathan paper has a more careful analysis; for engineering purposes the geometric model predicts within ~10% of measured speedup.
How much of this is in production right now
| Stack | Speculative decoding support | Default draft strategy |
|---|---|---|
| vLLM | Yes (since v0.3) | Separate draft model; n-gram speculation; MLP-Speculator |
| TensorRT-LLM | Yes | Draft target pair; supports Medusa heads, EAGLE, MTP |
| MLX | Yes (via mlx-lm) | Separate draft model |
| llama.cpp | Yes — and now MTP via PR #22673 (§S.3) | Draft model OR MTP heads |
| HuggingFace TGI | Yes | Draft target; Medusa heads |
By late 2025, speculative decoding is no longer optional in production-quality LLM serving. The interesting question — addressed by §S.2 — is what kind of speculation to use. The original draft+verifier scheme needs you to maintain two models; the modern alternative (Medusa, EAGLE, MTP) bakes the “draft” into auxiliary heads of the target model itself, eliminating the second-model overhead entirely.
Per round: cost = 1 (verifier pass) + K · c_d (K draft passes at c_d cost each). Yield = E[accepted prefix] + 1 (bonus). E[accepted prefix] = Σ over k=1..K of P(length ≥ k) = Σ p^k = p(1 − pᴷ)/(1 − p).
So per-round cost = 1 + K · c_d, per-round yield = p(1 − pᴷ)/(1 − p) + 1.
Speedup over naïve decoding (which yields 1 token per verifier pass = 1 per unit cost):
speedup(p, K, c_d) = [p(1 − pᴷ)/(1 − p) + 1] / [1 + K · c_d]
Optimal K is where ∂speedup/∂K = 0 — typically 4–8 in practice. Past optimal K, the linear K·c_d cost in the denominator dominates the diminishing exponential gain p^k in the numerator.
The whole subsequent research line on Medusa / EAGLE / MTP is about pushing p higher (better drafts) and c_d lower (cheaper drafts — ideally zero, achieved by sharing the trunk with the verifier).
END OF §S.1 — The autoregressive bottleneck & speculative decoding.
Built: SpecDecode viz (animate the draft+verify rounds, watch tokens stream out); spec_sim.c confirms theory matches Monte-Carlo to 3 decimals across the (p, K) grid; speedup at p=0.7 / K=4 is ~2×, at p=0.95 / K=8 is ~4×. Three recall items: easy (the bandwidth bottleneck), medium (rejection-sampling correctness proof), hard (deriving the full speedup formula).
Coming next: §S.2 — Self-speculative variants. Medusa heads, EAGLE features, MTP — how to push p higher and c_d to zero by baking the draft into the target.