BPE tokenisation
The book up to here has worked with vectors and matrices — quantities whose entries are real numbers. But the inputs LLMs actually take are strings of text. The bridge is tokenisation: a deterministic procedure that splits text into a sequence of tokens, each of which has an integer ID drawn from a fixed vocabulary. The model never sees the original characters; it sees the sequence of IDs and looks up a vector for each (§11.2). For a returning systems engineer the analogy that fits best is “tokenisation is to text what UTF-8 encoding is to Unicode” — a fixed-rule mapping between an abstract representation (characters / scalar values) and a concrete representation (token IDs / byte sequences) the runtime can actually process. The choice of tokeniser is one of the load-bearing design decisions in any LLM stack, and it’s almost always BPE or a close variant.
Three things tokenisation has to balance
A good tokenisation has to navigate three tensions:
- Vocabulary size. Bigger vocab = more parameters in the embedding table (Ch.11 §2) and more rows for the model to learn. Smaller vocab = longer sequences (every word splits into more tokens). For a 4096-dim transformer with vocab 128K, the embedding table is 128K × 4096 = 524 M parameters — comparable to several attention layers’ worth.
- Coverage. Every possible input string must encode to some sequence of valid tokens. Byte-level fallback (every byte 0–255 is a token) handles this robustly; older tokenisers used
<UNK>for unknown subwords and silently lost information. - Compression ratio. Common English words like “the” should be one token; technical terms or non-English text might need several. The compression ratio (characters per token) directly determines context length usage — at a 4-character-per-token average, a 128K context window holds ~500K characters of text.
Three obvious extreme choices fail in different ways:
| Pros | Cons | |
|---|---|---|
| Word-level | One token per word; small sequences | OOV words can’t be represented; vocab explodes with rare words; can’t handle non-space-separated languages (Chinese, Japanese) |
| Character-level | Tiny vocab; perfect coverage | Sequences are 4–5× longer; attention cost grows quadratically (Ch.13); model must learn word-level structure from scratch |
| Subword (BPE) | Common words one token; rare words split into pieces; bounded vocab | More complex; vocab size is a tuning knob |
BPE — invented by Gage 1994 as a data-compression algorithm and adapted to NLP by Sennrich, Haddow & Birch 2016 (“Neural Machine Translation of Rare Words with Subword Units,” ACL) — solves the three-way trade. Almost every production LLM uses BPE or a structural cousin.
The algorithm
BPE training, in pseudocode:
The viz below shows the encoding step at work. Type any text; click “next” to apply one merge at a time, or “play all” to run them in sequence:
the, ␣quick, ␣brown become single tokens. The number of merges
determines the vocabulary size; modern LLMs use 32K–128K tokens.Each step finds the highest-priority pair in the merge list and combines it. Watch how common subwords (the, qu, brown) emerge as single tokens after several merges — that’s BPE’s compression in action.
BPE training starts with character-level tokens and repeatedly merges the most-frequent adjacent token pair in the training corpus, adding the merged pair as a new vocab entry, until the vocab reaches a target size.
Structural property exploited: language has strong adjacent-pair statistics. Common letter pairs (th, in, er, on), common subwords (’ the’, ’ and’), common morphemes (-ing, -ed), and common whole words occur far more often than chance. By greedily merging the most-frequent pairs, BPE builds a vocabulary that gives one token to anything common and multiple tokens to anything rare — automatically matching the vocab to the data’s frequency distribution.
The result is that the average sequence length (in tokens) is much shorter than at the character level, but rare/novel inputs (technical jargon, non-English text, code identifiers) can still be represented by falling back to smaller pieces. Best of both worlds, modulo the vocabulary-size knob.
How big should the vocabulary be?
The standard sizes you’ll see in production models:
| Model family | Vocabulary size |
|---|---|
| GPT-2 (2019) | 50,257 |
| GPT-3 (2020) | 50,257 |
| Llama-1 (2023) | 32,000 |
| Llama-2 (2023) | 32,000 |
| Llama-3 (2024) | 128,256 |
| GPT-4 (2023) | ~100K |
| Qwen3 (2025) | 152,064 |
| DeepSeek-V3 (2024) | 129,280 |
The trend is upward — Llama-3’s 4× increase over Llama-2 reflected the design choice to handle multilingual and code data better. The cost: 4× the parameters in the embedding table (and the output projection, which is the same shape transposed). For a 4096-dim model, going from 32K vocab to 128K vocab adds 4096 · (128K − 32K) · 2 = 786M parameters to the model — comparable to a few attention layers.
The benefit: shorter sequences for the same text, so attention (Ch.13) costs less. For a transformer trained on internet text, vocab 128K typically gives ~3.5 characters per token vs vocab 32K’s ~4.5 characters per token — about 25% shorter sequences for the same content. Given attention’s quadratic cost, that’s a 1.5× compute savings on long sequences. Larger vocab buys compute savings at inference but costs more memory at every step; the tradeoff has favoured “bigger” as memory becomes more abundant and context windows grow.
Larger vocab = shorter sequences for the same content. At 4096-dim, going from 32K to 128K vocab typically reduces the characters-per-token average from ~4.5 to ~3.5, a 25% sequence-length reduction on typical English / code / multilingual text.
Cost saved at inference time: attention is O(N²) in sequence length (without FlashAttention’s tricks), so a 25% length reduction is a ~44% compute reduction on the attention sublayer. For long-context models (Llama-3’s 128K context, Gemini’s 1M context), attention cost dominates everything else; sequence-length reductions at the tokeniser level translate directly to massive inference cost savings.
For a 70B model running at 128K context, ~25% shorter sequences = roughly 35% lower attention cost per request = 35% more throughput for the same hardware. That’s a much better trade than the ~1% extra parameter memory the bigger embedding table costs. As context windows grow, larger vocabularies become MORE valuable — which is why every new frontier model since 2024 has been pushing vocab size up.
Edge cases the tokeniser handles (and sometimes botches)
A non-exhaustive list of where tokenisers get interesting:
- Whitespace. GPT-style tokenisers prepend
␣(a space placeholder) to most words — so"hello"and" hello"are different tokens. This is why models occasionally generate text with weird spacing artifacts. - Non-Latin scripts. Trained on mostly-English corpora, BPE allocates few merges to Chinese/Japanese/Korean characters, so non-Latin text gets tokenised per-character or per-byte. Llama-3’s vocab increase was largely about better non-English coverage.
- Code. Programming code has dense tokens (identifiers, operators, special characters). Modern tokenisers like Qwen3’s allocate vocabulary specifically for code patterns; older general-purpose vocabularies tokenise code wastefully.
- Numbers. Many tokenisers split numbers into per-digit tokens (
12345→12345or similar), which limits arithmetic capability. Recent models tune this — e.g., Llama-3 tokenises numbers per-digit for consistency. - Byte-level vs character-level fallback. GPT-2’s byte-level BPE means every possible byte sequence is representable; older character-level BPE could fail on unusual Unicode. Modern LLMs are all byte-level for robustness.
(1) Sequence length explodes. Character-level tokenisation gives ~4-5x longer sequences than BPE. For a 128K-token context window in BPE terms, character-level would need 512K-640K tokens for the same content. Attention is O(N²) in sequence length without FlashAttention tricks; even with FlashAttention’s O(N) memory + O(N²) compute, the compute cost grows by 25× for a 5× longer sequence. Same context coverage, 25× more flops per forward pass.
(2) The model wastes capacity learning word-level structure. BPE gives the model ‘the’ as one token; character-level requires the model to learn that t-h-e is a word. Empirically, character-level models need much more training data to reach the same downstream performance — recent research (ByT5, Xue et al. 2022) shows ~2-3× more training compute for comparable scores on standard benchmarks. The model is rediscovering tokenisation as part of its training.
(3) Throughput on long-context retrieval, RAG, code is decimated. Code and structured documents are MORE sensitive to sequence-length amplification than prose — code has lots of long identifiers and structural tokens that BPE handles cleanly but character-level inflates. A 32K-line code review fits in BPE’s 128K context but exceeds character-level’s effective capacity, even with the wider context window.
The “character-level avoids tokeniser complexity” argument is appealing in research papers (no tokeniser-induced biases, perfect coverage) but operationally untenable for production LLMs serving users. BPE’s complexity is small in implementation (~500 lines of code in tiktoken or tokenizers) and the wins are large. The trade is settled for production; character-level survives in specific niches (low-resource languages, code understanding) where its uniformity outweighs the inefficiency.
END OF CH.11 §1 — BPE tokenisation.
Built: BPETokenize viz (interactive: type text, step through merges, see characters collapse into subword tokens). Three recall items: easy (algorithm + structural property), medium (vocab-size tradeoff with quantitative cost analysis), hard (character-level alternative critique).
Coming next: §11.2 — The embedding table. Token IDs become dense vectors via row lookup.