BPE tokenisation

Section 11.1

BPE tokenisation

The book up to here has worked with vectors and matrices — quantities whose entries are real numbers. But the inputs LLMs actually take are strings of text. The bridge is tokenisation: a deterministic procedure that splits text into a sequence of tokens, each of which has an integer ID drawn from a fixed vocabulary. The model never sees the original characters; it sees the sequence of IDs and looks up a vector for each (§11.2). For a returning systems engineer the analogy that fits best is “tokenisation is to text what UTF-8 encoding is to Unicode” — a fixed-rule mapping between an abstract representation (characters / scalar values) and a concrete representation (token IDs / byte sequences) the runtime can actually process. The choice of tokeniser is one of the load-bearing design decisions in any LLM stack, and it’s almost always BPE algorithm BPE — a subword tokenisation algorithm. Starts with each character as its own token. Repeatedly finds the most-frequent adjacent token pair in a large training corpus and merges them into a new single token. Continues for k merges, producing k new tokens added to the vocabulary. Originally a data-compression algorithm (Gage 1994); adapted for NLP by Sennrich et al. 2016 ('Neural Machine Translation of Rare Words with Subword Units'); standard in GPT-2, GPT-3, GPT-4, Llama, and most modern transformers. or a close variant.

Three things tokenisation has to balance

A good tokenisation has to navigate three tensions:

Vocabulary size. Bigger vocab = more parameters in the embedding table (Ch.11 §2) and more rows for the model to learn. Smaller vocab = longer sequences (every word splits into more tokens). For a 4096-dim transformer with vocab 128K, the embedding table is 128K × 4096 = 524 M parameters — comparable to several attention layers’ worth.
Coverage. Every possible input string must encode to some sequence of valid tokens. Byte-level fallback (every byte 0–255 is a token) handles this robustly; older tokenisers used <UNK> for unknown subwords and silently lost information.
Compression ratio. Common English words like “the” should be one token; technical terms or non-English text might need several. The compression ratio (characters per token) directly determines context length usage — at a 4-character-per-token average, a 128K context window holds ~500K characters of text.

Three obvious extreme choices fail in different ways:

	Pros	Cons
Word-level	One token per word; small sequences	OOV words can’t be represented; vocab explodes with rare words; can’t handle non-space-separated languages (Chinese, Japanese)
Character-level	Tiny vocab; perfect coverage	Sequences are 4–5× longer; attention cost grows quadratically (Ch.13); model must learn word-level structure from scratch
Subword (BPE)	Common words one token; rare words split into pieces; bounded vocab	More complex; vocab size is a tuning knob

BPE — invented by Gage 1994 as a data-compression algorithm and adapted to NLP by Sennrich, Haddow & Birch 2016 (“Neural Machine Translation of Rare Words with Subword Units,” ACL) — solves the three-way trade. Almost every production LLM uses BPE or a structural cousin.

The algorithm

BPE training, in pseudocode:

Input: corpus (large text), target vocabulary size V Output: a merge list (an ordered sequence of pairs) 1. Initialise: every character is its own token. Vocab = {all characters}. 2. While |vocab| < V: a. Count the frequency of every adjacent token pair in the corpus. b. Find the most-frequent pair (left, right). c. Add the merged pair to the vocab as a new token. d. Append (left, right) → merged to the merge list. 3. Return the merge list. Encoding new text at INFERENCE time: 1. Split input into characters. 2. While any pair in the current token list is in the merge list: a. Find the merge with the EARLIEST learned priority. b. Apply it to the leftmost matching pair. 3. Return the resulting token sequence.

The viz below shows the encoding step at work. Type any text; click “next” to apply one merge at a time, or “play all” to run them in sequence:

input text

step 0 / 8

step 0 — initial state (1 char per token)

tokens: 25 final: 17

the␣quick␣brown␣fox␣jumps

BPE starts by treating every character as a separate token. At each step it finds the most-frequently-co-occurring adjacent pair (learned during training) and merges it into a single new token. After enough merges, common subwords like the, ␣quick, ␣brown become single tokens. The number of merges determines the vocabulary size; modern LLMs use 32K–128K tokens.

Click next or play all. Each step applies the highest-priority merge from a tiny hand-crafted merge table (a real BPE trainer learns these from billions of tokens). Watch the 25-character input collapse to a small handful of subword tokens — that's the encode side of every tokenizer.

Each step finds the highest-priority pair in the merge list and combines it. Watch how common subwords (the, qu, brown) emerge as single tokens after several merges — that’s BPE’s compression in action.

— think, then check —

BPE training starts with character-level tokens and repeatedly merges the most-frequent adjacent token pair in the training corpus, adding the merged pair as a new vocab entry, until the vocab reaches a target size.

Structural property exploited: language has strong adjacent-pair statistics. Common letter pairs (th, in, er, on), common subwords (’ the’, ’ and’), common morphemes (-ing, -ed), and common whole words occur far more often than chance. By greedily merging the most-frequent pairs, BPE builds a vocabulary that gives one token to anything common and multiple tokens to anything rare — automatically matching the vocab to the data’s frequency distribution.

The result is that the average sequence length (in tokens) is much shorter than at the character level, but rare/novel inputs (technical jargon, non-English text, code identifiers) can still be represented by falling back to smaller pieces. Best of both worlds, modulo the vocabulary-size knob.

↳ §11.1 BPE algorithm

How big should the vocabulary be?

The standard sizes you’ll see in production models:

Model family	Vocabulary size
GPT-2 (2019)	50,257
GPT-3 (2020)	50,257
Llama-1 (2023)	32,000
Llama-2 (2023)	32,000
Llama-3 (2024)	128,256
GPT-4 (2023)	~100K
Qwen3 (2025)	152,064
DeepSeek-V3 (2024)	129,280

The trend is upward — Llama-3’s 4× increase over Llama-2 reflected the design choice to handle multilingual and code data better. The cost: 4× the parameters in the embedding table (and the output projection, which is the same shape transposed). For a 4096-dim model, going from 32K vocab to 128K vocab adds 4096 · (128K − 32K) · 2 = 786M parameters to the model — comparable to a few attention layers.

The benefit: shorter sequences for the same text, so attention (Ch.13) costs less. For a transformer trained on internet text, vocab 128K typically gives ~3.5 characters per token vs vocab 32K’s ~4.5 characters per token — about 25% shorter sequences for the same content. Given attention’s quadratic cost, that’s a 1.5× compute savings on long sequences. Larger vocab buys compute savings at inference but costs more memory at every step; the tradeoff has favoured “bigger” as memory becomes more abundant and context windows grow.

— think, then check —

Larger vocab = shorter sequences for the same content. At 4096-dim, going from 32K to 128K vocab typically reduces the characters-per-token average from ~4.5 to ~3.5, a 25% sequence-length reduction on typical English / code / multilingual text.

Cost saved at inference time: attention is O(N²) in sequence length (without FlashAttention’s tricks), so a 25% length reduction is a ~44% compute reduction on the attention sublayer. For long-context models (Llama-3’s 128K context, Gemini’s 1M context), attention cost dominates everything else; sequence-length reductions at the tokeniser level translate directly to massive inference cost savings.

For a 70B model running at 128K context, ~25% shorter sequences = roughly 35% lower attention cost per request = 35% more throughput for the same hardware. That’s a much better trade than the ~1% extra parameter memory the bigger embedding table costs. As context windows grow, larger vocabularies become MORE valuable — which is why every new frontier model since 2024 has been pushing vocab size up.

↳ §11.1 vocab size

Edge cases the tokeniser handles (and sometimes botches)

A non-exhaustive list of where tokenisers get interesting:

Whitespace. GPT-style tokenisers prepend ␣ (a space placeholder) to most words — so "hello" and " hello" are different tokens. This is why models occasionally generate text with weird spacing artifacts.
Non-Latin scripts. Trained on mostly-English corpora, BPE allocates few merges to Chinese/Japanese/Korean characters, so non-Latin text gets tokenised per-character or per-byte. Llama-3’s vocab increase was largely about better non-English coverage.
Code. Programming code has dense tokens (identifiers, operators, special characters). Modern tokenisers like Qwen3’s allocate vocabulary specifically for code patterns; older general-purpose vocabularies tokenise code wastefully.
Numbers. Many tokenisers split numbers into per-digit tokens (12345 → 1 23 45 or similar), which limits arithmetic capability. Recent models tune this — e.g., Llama-3 tokenises numbers per-digit for consistency.
Byte-level vs character-level fallback. GPT-2’s byte-level BPE means every possible byte sequence is representable; older character-level BPE could fail on unusual Unicode. Modern LLMs are all byte-level for robustness.

— think, then check —

(1) Sequence length explodes. Character-level tokenisation gives ~4-5x longer sequences than BPE. For a 128K-token context window in BPE terms, character-level would need 512K-640K tokens for the same content. Attention is O(N²) in sequence length without FlashAttention tricks; even with FlashAttention’s O(N) memory + O(N²) compute, the compute cost grows by 25× for a 5× longer sequence. Same context coverage, 25× more flops per forward pass.

(2) The model wastes capacity learning word-level structure. BPE gives the model ‘the’ as one token; character-level requires the model to learn that t-h-e is a word. Empirically, character-level models need much more training data to reach the same downstream performance — recent research (ByT5, Xue et al. 2022) shows ~2-3× more training compute for comparable scores on standard benchmarks. The model is rediscovering tokenisation as part of its training.

(3) Throughput on long-context retrieval, RAG, code is decimated. Code and structured documents are MORE sensitive to sequence-length amplification than prose — code has lots of long identifiers and structural tokens that BPE handles cleanly but character-level inflates. A 32K-line code review fits in BPE’s 128K context but exceeds character-level’s effective capacity, even with the wider context window.

The “character-level avoids tokeniser complexity” argument is appealing in research papers (no tokeniser-induced biases, perfect coverage) but operationally untenable for production LLMs serving users. BPE’s complexity is small in implementation (~500 lines of code in tiktoken or tokenizers) and the wins are large. The trade is settled for production; character-level survives in specific niches (low-resource languages, code understanding) where its uniformity outweighs the inefficiency.

↳ §11.1 tokenisation tradeoffs + Ch.13 attention preview

END OF CH.11 §1 — BPE tokenisation.
Built: BPETokenize viz (interactive: type text, step through merges, see characters collapse into subword tokens). Three recall items: easy (algorithm + structural property), medium (vocab-size tradeoff with quantitative cost analysis), hard (character-level alternative critique).
Coming next: §11.2 — The embedding table. Token IDs become dense vectors via row lookup.