TOKENS & EMBEDDINGS
Section 11.2
02

The embedding table

§11.1 took text and produced a sequence of integer token IDs. This section makes them vectors. The mechanism is a single learned matrix W₍ₑₘᵦₑ𝒹₎ of shape (vocab × d), where each row is the dense vector representation of one token. Look up token ID k → take row k. Done. The conceptual move is small but consequential: the model’s understanding of every concept lives in this matrix’s rows. Vocabulary frequency, semantic similarity, syntactic role, language and modality — all encoded as positions in the 4096-dim (or higher) embedding space, learned during training. This section nails down the mechanics, the cost (this one matrix is often 5–15% of a model’s parameters), and the standard “tied embedding” trick that halves the parameter cost.

Look up by row

The mathematics is one of the simplest operations in the book:

x_t = W_embed [ token_id_t , : ] ← row token_id_t of the embedding matrix equivalently: x_t = one_hot(token_id_t)ᵀ · W_embed (a matmul) but implemented as a GATHER, not a matmul: x_t = W_embed[token_id_t] ← one memory access, O(d) bytes

For a batch of sequences with token IDs B × N, the embedded result is B × N × d — the standard transformer input shape. Every later layer reads tokens as vectors, not as IDs.

tokens (IDs)
▁the 2
▁quick 5
▁brown 7
▁fox 11
. 0
row lookup
W_embed (vocab=12, d=8)
0
-0.26-0.930.42-0.170.16-0.180.11-0.03
1
-0.440.360.30-0.460.11-0.960.90-0.80
2
-0.550.34-0.100.030.81-0.980.54-0.77
3
-0.730.94-0.380.11-0.80-0.180.79-0.98
4
-0.180.54-0.090.27-0.810.090.38-0.93
5
-0.250.57-0.310.97-0.320.26-0.250.59
6
-0.500.57-0.880.43-0.070.14-0.940.87
7
-0.480.57-0.160.00-0.330.67-0.160.55
8
0.52-0.43-0.940.50-0.020.42-0.060.08
9
0.02-0.620.290.58-0.130.74-0.850.61
10
0.59-0.080.74-0.26-0.470.84-0.090.23
11
0.61-0.190.63-0.370.92-0.30-0.900.49
output: (seq_len=5, d=8) — one row per token
▁the
-0.550.34-0.100.030.81-0.980.54-0.77
▁quick
-0.250.57-0.310.97-0.320.26-0.250.59
▁brown
-0.480.57-0.160.00-0.330.67-0.160.55
▁fox
0.61-0.190.63-0.370.92-0.30-0.900.49
.
-0.26-0.930.42-0.170.16-0.180.11-0.03
Each token ID indexes a row of the embedding matrix. The lookup is implemented as a gather, not as a matrix multiply against a one-hot vector (though the two are mathematically equivalent). Cost: one cache-aligned memory access per token instead of O(vocab · d) ops. For vocab 128K and d 4096, the embedding table is 524 MB in float32 / 1 GB in bfloat16 — comparable to a full attention layer of parameters.
Embedding lookup: token ID → row of a learned matrix. The math is "select row i" — equivalent to a one-hot eᵢᵀ · W_embed but implemented as a gather. Modern LLMs spend ~5-15% of parameters here (and another 5-15% on the tied output projection).

The viz shows five tokens being looked up from a tiny vocab-12 embedding table. The selected rows (highlighted) become the output sequence on the right. Same operation, just one row per token, no math beyond pointer indexing.

The “matmul” formulation — multiplying a one-hot vector by W₍ₑₘᵦₑ𝒹₎ — is mathematically correct but operationally absurd: a 128K-dim one-hot times a (128K × 4096) matrix is 524M multiplications to produce a single row that we could have just read. Every framework implements embedding as gather:

// embedding forward (PyTorch's nn.Embedding, JAX's jax.nn.embed, etc.):
for (int b = 0; b < B; b++)
  for (int t = 0; t < N; t++)
    memcpy(&output[b * N * d + t * d], &W[token_ids[b * N + t] * d], d * sizeof(float));

One d-element memcpy per token. On modern hardware (CUDA, Apple Silicon, AVX), this is a hardware-accelerated gather that runs at memory bandwidth — no matmul kernel needed.

— think, then check —

For vocab V = 128K and embedding dim d = 4096: one_hot @ W_embed requires V·d = ~500M multiplications, almost all by zero. Even though the result is the same as just reading row k of W_embed, the matmul wastes 99.9…% of its compute on the zeros of the one-hot vector.

Implemented as a gather (memcpy of d floats from row k), the cost is O(d) memory accesses instead of O(V·d) flops. On modern hardware this is 1 cache-line-aligned memcpy per token, running at memory-bandwidth speed.

The gather formulation is also more cache-friendly: only the rows of W_embed that the batch actually needs are touched, so for a sequence of 1000 tokens we touch ~1000 rows of the 128K-row table — not the whole thing. Random sparse access patterns like this are exactly what GPU memory hierarchies are good at.

Tied embeddings — half the parameters for free

The same vocabulary appears twice in a typical LLM: once at the input (the embedding table that turns token IDs into vectors) and once at the output (the projection from hidden state back to vocabulary logits, used by softmax + cross-entropy from Ch.9 §1). Both matrices are shape (vocab × d) — they’re the same shape, just used in opposite directions.

Tied embeddings share the parameters between the two matrices:

Tied: W_input ≡ W_outputᵀ ← one set of parameters, used both ways Untied: W_input separate from W_output ← two matrices, 2× the cost Forward pass (tied): embedded_input = W_input[token_id] ← gather, embedding side ... final_logits = hidden_state · W_inputᵀ ← matmul, output side

The intuition: input embeddings map “token ID → semantic vector” and output projections map “semantic vector → distribution over token IDs.” Both directions cross the same vocabulary-vs-semantics boundary, so it’s natural to use the same parameters. Empirically, Press & Wolf 2017 (“Using the Output Embedding to Improve Language Models”) showed loss-equivalent results between tied and untied at the same total parameter count — tied gets the same loss with half the parameters allocated to embeddings, freeing those parameters for additional transformer layers.

Modern LLMs from GPT-2 onward almost all use tied embeddings. The exception is when the input and output spaces differ — e.g., multilingual translation where the source and target languages have different vocabularies.

— think, then check —

The input embedding matrix W_embed and the output projection matrix W_out are shape (vocab × d) and (d × vocab) respectively. ‘Tied’ means they share parameters: W_out = W_embed^T. The model has ONE (vocab × d) matrix, used in two places.

Why no quality loss: both matrices are doing the same job in different directions — mapping between the vocabulary space (V-dim) and the model’s semantic space (d-dim). The input direction asks ‘what’s the vector for this token ID?’ and the output direction asks ‘how compatible is this hidden state with each token in the vocab?’ Both are dot-products of token vectors with semantic vectors. Empirically (Press & Wolf 2017) the constraint W_out = W_embed^T doesn’t hurt; the matrices were already learning related representations under the gradient.

Operational benefit: halves embedding-related parameters. For a 4096-dim model with vocab 128K, untied embeddings cost 1.05B params (524M each); tied costs 524M total. Saves 524M parameters that can be spent on more transformer layers or wider hidden states — typically a better use of the budget.

Cost analysis at frontier scale

What does the embedding table actually cost?

embedding_params = V · d output_params = V · d (untied) or 0 (tied) For d = 4096, V = 128K: embedding only: 524M params embedding + output: 1.05B params (untied) or 524M (tied) In bytes (bfloat16): 524M params × 2 bytes = ~1 GB For a 70B model: embedding + output (tied) ≈ 524M / 70B = ~0.75% of total params ...but ~5–10% of memory bandwidth budget (gather-heavy), and the OUTPUT projection is one of the slowest ops in inference because it's a (d × V) matmul against the final hidden state — V is 128K wide.

The embedding table itself is a small fraction of total parameters at frontier scale, but it dominates two operations:

This is also why LoRA typically doesn’t adapt the embedding table — there are usually too many rows (the full vocab) and the gradient signal per row is too sparse to make a low-rank update effective. LoRA usually targets the attention and MLP weights, leaving embeddings frozen at their pretrained values.

— think, then check —

The output projection (lm_head) computes logits for every vocabulary entry at every token position:

logits = hidden_state · W_lm_head^T → shape (seq_len, V).

Concretely: at each generated token, we multiply the (1 × 4096) final hidden state by the (4096 × 128K) tied embedding matrix to get a (1 × 128K) logit vector. That’s ~500M multiply-adds per generated token, just for the lm_head.

Compare to a single attention forward at seq_len 1 in the same model: ~128M flops. The lm_head is ~4× more expensive than one attention layer’s forward pass at decode time — and it runs once per token.

Why tied embeddings don’t help: tying saves the PARAMETER count, but the COMPUTATION at the lm_head step is unchanged. We still have to multiply the hidden state by all V·d = 524M weights to get all V logits. The matmul is the same; we’ve just stored the weights once and used them in two places.

The real fixes for output-projection cost are (a) smaller vocab where possible, (b) speculative decoding (Ch.S §1) which amortises the lm_head over multiple speculative tokens per verifier pass, (c) hierarchical or adaptive softmax variants that compute only a subset of logits. Production inference stacks combine all three at different scales.

END OF CH.11 §2 — The embedding table.
Built: EmbeddingLookup viz (5 tokens, 12-row table, see the gather + the resulting embedded sequence). Three recall items: easy (why gather not matmul), medium (tied embeddings), hard (output-projection cost analysis and why tying doesn’t fix it).
Coming next: §11.3 — Positional encoding → RoPE. How the model knows token order.