The embedding table
§11.1 took text and produced a sequence of integer token IDs. This section makes them vectors. The mechanism is a single learned matrix W₍ₑₘᵦₑ𝒹₎ of shape (vocab × d), where each row is the dense vector representation of one token. Look up token ID k → take row k. Done. The conceptual move is small but consequential: the model’s understanding of every concept lives in this matrix’s rows. Vocabulary frequency, semantic similarity, syntactic role, language and modality — all encoded as positions in the 4096-dim (or higher) embedding space, learned during training. This section nails down the mechanics, the cost (this one matrix is often 5–15% of a model’s parameters), and the standard “tied embedding” trick that halves the parameter cost.
Look up by row
The mathematics is one of the simplest operations in the book:
For a batch of sequences with token IDs B × N, the embedded result is B × N × d — the standard transformer input shape. Every later layer reads tokens as vectors, not as IDs.
The viz shows five tokens being looked up from a tiny vocab-12 embedding table. The selected rows (highlighted) become the output sequence on the right. Same operation, just one row per token, no math beyond pointer indexing.
The “matmul” formulation — multiplying a one-hot vector by W₍ₑₘᵦₑ𝒹₎ — is mathematically correct but operationally absurd: a 128K-dim one-hot times a (128K × 4096) matrix is 524M multiplications to produce a single row that we could have just read. Every framework implements embedding as gather:
// embedding forward (PyTorch's nn.Embedding, JAX's jax.nn.embed, etc.):
for (int b = 0; b < B; b++)
for (int t = 0; t < N; t++)
memcpy(&output[b * N * d + t * d], &W[token_ids[b * N + t] * d], d * sizeof(float));
One d-element memcpy per token. On modern hardware (CUDA, Apple Silicon, AVX), this is a hardware-accelerated gather that runs at memory bandwidth — no matmul kernel needed.
For vocab V = 128K and embedding dim d = 4096: one_hot @ W_embed requires V·d = ~500M multiplications, almost all by zero. Even though the result is the same as just reading row k of W_embed, the matmul wastes 99.9…% of its compute on the zeros of the one-hot vector.
Implemented as a gather (memcpy of d floats from row k), the cost is O(d) memory accesses instead of O(V·d) flops. On modern hardware this is 1 cache-line-aligned memcpy per token, running at memory-bandwidth speed.
The gather formulation is also more cache-friendly: only the rows of W_embed that the batch actually needs are touched, so for a sequence of 1000 tokens we touch ~1000 rows of the 128K-row table — not the whole thing. Random sparse access patterns like this are exactly what GPU memory hierarchies are good at.
Tied embeddings — half the parameters for free
The same vocabulary appears twice in a typical LLM: once at the input (the embedding table that turns token IDs into vectors) and once at the output (the projection from hidden state back to vocabulary logits, used by softmax + cross-entropy from Ch.9 §1). Both matrices are shape (vocab × d) — they’re the same shape, just used in opposite directions.
Tied embeddings share the parameters between the two matrices:
The intuition: input embeddings map “token ID → semantic vector” and output projections map “semantic vector → distribution over token IDs.” Both directions cross the same vocabulary-vs-semantics boundary, so it’s natural to use the same parameters. Empirically, Press & Wolf 2017 (“Using the Output Embedding to Improve Language Models”) showed loss-equivalent results between tied and untied at the same total parameter count — tied gets the same loss with half the parameters allocated to embeddings, freeing those parameters for additional transformer layers.
Modern LLMs from GPT-2 onward almost all use tied embeddings. The exception is when the input and output spaces differ — e.g., multilingual translation where the source and target languages have different vocabularies.
The input embedding matrix W_embed and the output projection matrix W_out are shape (vocab × d) and (d × vocab) respectively. ‘Tied’ means they share parameters: W_out = W_embed^T. The model has ONE (vocab × d) matrix, used in two places.
Why no quality loss: both matrices are doing the same job in different directions — mapping between the vocabulary space (V-dim) and the model’s semantic space (d-dim). The input direction asks ‘what’s the vector for this token ID?’ and the output direction asks ‘how compatible is this hidden state with each token in the vocab?’ Both are dot-products of token vectors with semantic vectors. Empirically (Press & Wolf 2017) the constraint W_out = W_embed^T doesn’t hurt; the matrices were already learning related representations under the gradient.
Operational benefit: halves embedding-related parameters. For a 4096-dim model with vocab 128K, untied embeddings cost 1.05B params (524M each); tied costs 524M total. Saves 524M parameters that can be spent on more transformer layers or wider hidden states — typically a better use of the budget.
Cost analysis at frontier scale
What does the embedding table actually cost?
The embedding table itself is a small fraction of total parameters at frontier scale, but it dominates two operations:
- Token-by-token decoding cost. Every generated token requires reading the embedding row for the previous token (cheap) AND running the output projection across the full vocab (~500M params × 2 bytes = 1 GB) to compute the next-token logits. The latter is a real bottleneck — it’s the matmul behind every “predict the next token” forward pass.
- Initial loading. At model load time, the embedding table is one of the largest single tensors. Memory-mapping the GGUF (Ch.S §3) means we can start inference before the whole table is paged in — but the first few tokens hit cold-cache embedding reads.
This is also why LoRA typically doesn’t adapt the embedding table — there are usually too many rows (the full vocab) and the gradient signal per row is too sparse to make a low-rank update effective. LoRA usually targets the attention and MLP weights, leaving embeddings frozen at their pretrained values.
The output projection (lm_head) computes logits for every vocabulary entry at every token position:
logits = hidden_state · W_lm_head^T → shape (seq_len, V).
Concretely: at each generated token, we multiply the (1 × 4096) final hidden state by the (4096 × 128K) tied embedding matrix to get a (1 × 128K) logit vector. That’s ~500M multiply-adds per generated token, just for the lm_head.
Compare to a single attention forward at seq_len 1 in the same model: ~128M flops. The lm_head is ~4× more expensive than one attention layer’s forward pass at decode time — and it runs once per token.
Why tied embeddings don’t help: tying saves the PARAMETER count, but the COMPUTATION at the lm_head step is unchanged. We still have to multiply the hidden state by all V·d = 524M weights to get all V logits. The matmul is the same; we’ve just stored the weights once and used them in two places.
The real fixes for output-projection cost are (a) smaller vocab where possible, (b) speculative decoding (Ch.S §1) which amortises the lm_head over multiple speculative tokens per verifier pass, (c) hierarchical or adaptive softmax variants that compute only a subset of logits. Production inference stacks combine all three at different scales.
END OF CH.11 §2 — The embedding table.
Built: EmbeddingLookup viz (5 tokens, 12-row table, see the gather + the resulting embedded sequence). Three recall items: easy (why gather not matmul), medium (tied embeddings), hard (output-projection cost analysis and why tying doesn’t fix it).
Coming next: §11.3 — Positional encoding → RoPE. How the model knows token order.