Product quantization + binary codes

Section 26.2

Product quantization + binary codes

HNSW gives you fast TRAVERSAL but doesn’t reduce per-vector storage or distance-computation cost. At 1B vectors × 1024 dim × fp16: that’s 2 TB of memory just for the embeddings. Untenable on any single machine. Product quantization (PQ) (Jegou 2011) compresses each vector into a small 8-32 byte code, reducing memory by 50-100×. Binary embeddings push further: 1 bit per dimension, distance via popcount. Combined with HNSW or IVF (inverted file with coarse clustering), these form the production billion-scale ANN recipe used in FAISS, Milvus, and every major vector database.

Product Quantization (PQ)

The idea: split each d-dim vector into M subvectors of dim d/M; quantize each subvector to a learned codebook entry.

PQ setup: d-dim vector x = [x_1, x_2, ..., x_M] where x_m ∈ ℝ^{d/M}. For each subvector position m, train a codebook C_m ∈ ℝ^{K × d/M} via k-means: C_m has K (typically 256) "codewords", each a d/M-dim centroid. Compression: for each subvector x_m, find its nearest codeword: code[m] = argmin_k || x_m - C_m[k] ||² Stored: M integers in [0, K). With K = 256: M bytes per vector. (vs original 4·d bytes for fp32) For d = 1024, M = 16, K = 256: 16 bytes per vector. 256× smaller than fp32. Search: Compute query subvectors: q = [q_1, q_2, ..., q_M]. Precompute distance tables: dist_table[m][k] = || q_m - C_m[k] ||² for all m, k. Tiny (M · K = 4096 floats). For each database vector: distance_estimate = sum_m dist_table[m][code[m]] This is the LOOKUP-TABLE trick: per-database-vector cost is M memory reads + M sums. For M = 16: ~16 ops per distance vs original ~2048. ~100× faster. Accuracy: estimated distance is within ~2-5% of exact for typical embeddings.

PQ ANN technique A vector compression and search algorithm (Jegou 2011) that splits each d-dim vector into M subvectors and quantizes each subvector independently via k-means with K codewords. Each vector becomes an M-integer code (M bytes if K=256). Search uses precomputed query-to-codeword distance tables and table lookups per database vector — ~M operations vs O(d) for exact search. 50-100× memory compression, 10-50× speedup. The foundation of FAISS IVF+PQ, Pinecone's PQ tier, ScaNN's compression. Tunable: more M = better recall, less compression; more K = better recall, more table memory. is the workhorse of large-scale ANN. It compresses 50-200× while preserving most of the nearest-neighbor information.

Binary embeddings — 1 bit per dimension

The extreme version: 1 bit per element. Each vector becomes a d-bit binary string.

Binary embedding: For each vector x ∈ ℝ^d: binarise by sign: b[i] = 1 if x[i] > 0 else 0 (or learned threshold) Storage: d/8 bytes per vector. For d = 1024: 128 bytes. Distance: Hamming distance (number of differing bits): hamming(b, q) = popcount(b XOR q) On modern CPUs: POPCNT instruction handles 64 bits per cycle. For d = 1024: 16 popcount ops per pair = 16 cycles = ~5 ns. Compare to fp32 dot product on the same vectors: ~50 ns. ~10× faster. Accuracy: binary embedding loses MOST of the magnitude information. For typical embeddings, recall@10 drops from 95% (exact) to 30-50% (binary). Mitigated by: - Use binary as a coarse filter; re-rank top-K with exact distance. - Use multi-bit codes (2-4 bits per dim) for better accuracy. Best use case: extremely-large-scale retrieval where you can afford recall drop.

— think, then check —

Budget:

100M vectors → 8 GB → 80 bytes per vector. (Going from 400B/vector down to 80B/vector — 5× compression.)

Wait, that’s modest compression. Let me recheck: fp32 = 4 bytes/dim × 1024 dim = 4096 bytes per vector. To fit in 80 bytes per vector: 51× compression.

PQ parameters:

If we use K = 256 codewords per subvector (1 byte per subvector index):

Storage per vector = M bytes (one byte per subvector).

To fit 80 bytes: M ≤ 80. Let’s use M = 64.

Subvector dim = 1024 / 64 = 16 dimensions per subvector.

Codebooks: M × K × d/M = 64 × 256 × 16 = 256K floats × 4 bytes = 1 MB total. Negligible.

Expected accuracy:

With M = 64, K = 256: each 16-dim subvector is quantized to one of 256 representative values. Quantization error per subvector ≈ 16 × σ² / 256 ≈ σ²/16 where σ is the typical subvector value scale.

Per-vector distance estimation error: ~ M × σ²/16 = 4σ². The total distance is roughly d × σ² = 1024σ². So distance error is ~0.4% of typical distance. Very good.

Empirically: PQ at M = 16-32 typically achieves 85-95% recall@10 on standard benchmarks; M = 64 gives 95%+.

Storage:

Per vector: 64 bytes.

Total for 100M: 6.4 GB. Fits in 8 GB budget with room for index structure.

Search cost:

Per query: build distance table (M × K = 16K float subtractions + sums = trivial).

Per database vector: 64 table lookups + 64 additions = ~128 cycles ≈ 40 ns on modern CPU.

For 100M lookups: 4 seconds linear scan. Still slow — need to combine with HNSW or IVF for efficient navigation.

Combining with IVF:

IVF partitions the database into ~10K clusters. Each query probes ~16-64 clusters (the closest to query). Reduces search space by 100-600×.

Combined: IVF + PQ = millions of vectors searched in ~10ms. Industry-standard recipe.

The full recipe (FAISS IVF+PQ):

Train: cluster vectors into N_clusters centroids. Train PQ codebooks.
Index: assign each vector to its nearest cluster. Store its PQ code in the cluster’s inverted list.
Query: find query’s top K_probe nearest clusters. Search each cluster’s inverted list with PQ distance estimation. Return top-K overall.

Tunable: N_clusters (typically √N, e.g., 10K for 100M); K_probe (typically 16-128); M, K for PQ.

↳ §26.2 + Jegou 2011

The lookup-table trick

The PQ distance computation is critical to understand:

Standard distance: d(q, x) = sum_{i=1..d} (q[i] - x[i])² PQ distance with M codes: d_pq(q, x) = sum_{m=1..M} dist_table[m][x_code[m]] where dist_table[m][k] = || q_m - C_m[k] ||² (precomputed per query) Per-database-vector cost: - M table reads (each ~5 ns from L1/L2 cache). - M additions (~1 ns each). - Total: ~6M ns per distance estimate. For M = 16: ~100 ns per distance. For d = 1024 fp32 distance: ~2000 ns. PQ is 20× faster. Memory bandwidth: PQ code: M bytes per vector → memory bandwidth limited by M, not d. For d = 1024 → M = 16: 64× less HBM traffic per database vector. Lookup tables fit in L1/L2 cache → not memory-bound. This is why PQ scales: the bottleneck shifts from "compute distance to d dim vector" to "look up M small numbers per vector" — a fundamentally cheaper operation.

Binary embeddings revisited

Modern advance: quantization-aware embedding training embedding optimization Training the embedding model itself with awareness that it will be quantized at search time. Uses a quantization-aware loss (e.g., contrastive learning with both fp32 and quantized representations) so the trained embedding's quantized form preserves more relational information than naively quantized fp32 embeddings. Cohere's binary embeddings (2024) and OpenAI's text-embedding-3-large with matryoshka encoding both use variants of this technique. aligns the embedding training to be MORE TOLERANT of quantization. The Cohere “binary” and “int8” embedding offerings train with quantization in mind, achieving 80-90% recall at 1 bit/dim compression.

Binary embedding production realities: - Cohere v3 binary: 1024-dim → 128 bytes/vector. 32× compression vs fp32. - Recall@10: 87% on STS-B with binary vs 91% on fp32. - Search throughput: 50× faster than fp32 dot product on CPU. - Hybrid: use binary for the first-pass filter, re-rank top-100 with exact distance. When binary wins: - Extremely high-throughput retrieval (millions of QPS). - Heavy-tailed budgets where memory is the constraint. - When you can re-rank top-K with the original embeddings.

— think, then check —

Per-pair Hamming distance:

For two 1024-bit binary vectors a and b:

1. XOR: a ⊕ b. Bits that differ become 1.

2. Popcount: count the 1 bits in the result.

On modern CPUs with POPCNT instruction:

1024 bits = 16 × 64-bit words.
16 × XOR (1 cycle each).
16 × POPCNT (3 cycles each on Intel).
15 × ADD to sum the popcounts.
Total: ~80 cycles ≈ 25-30 ns per pair.

Why faster than dot product:

fp32 dot product for 1024 dims: 1024 multiplies + 1023 adds = ~2000 FLOPs. Modern CPU: ~500-1000 ns due to memory bandwidth (must load 4 KB of vector data).

Binary Hamming: only 128 bytes of vector data; processed at hardware-popcount speed. ~25 ns.

Speedup: 20-40× for in-cache data.

GPU equivalent:

NVIDIA H100 has dedicated POPC instructions and SIMD popcount:

Each thread can process 32 bits per cycle with __popc.
For 1024-bit vector: 32 threads × 32 bits per cycle = 32 cycles.
At 1.5 GHz: ~21 ns per pair per thread.
With 132 SMs × 64 warps × 32 threads in parallel: billions of distance pairs per second per H100.

Specialised libraries (FAISS GPU, cuVS) use this for high-throughput binary search.

When you’d use binary:

Massive scale (1B+ vectors).
CPU-only deployment (binary is much friendlier to commodity hardware than fp16 PQ).
First-pass filtering: find candidates with binary, re-rank with exact distance.
Storage-constrained systems (mobile, edge).

↳ §26.2 + binary embeddings

FAISS IVF+PQ — the production recipe

FAISS IVF+PQ production ANN The standard billion-scale ANN recipe in FAISS (Facebook AI Similarity Search, 2017). Combines IVF (inverted file index — coarse clustering of vectors into N_clusters cells) with PQ (product quantization — store each vector's PQ code). Search: find K_probe nearest cluster centroids; scan their inverted lists with PQ distance estimation. Used by Pinecone, Milvus, Weaviate, and most production vector databases for billion+-scale collections. Tunable via N_clusters, K_probe, M, K to balance recall vs throughput. is the canonical production recipe, combining IVF (inverted file) with PQ:

FAISS IVF+PQ structure: Train phase: 1. K-means cluster the database into N_clusters centroids (typically √N). 2. For each cluster, train PQ codebooks on its assigned vectors. Index phase: For each database vector x: 1. Find its nearest cluster c. 2. Compute PQ code of x relative to c's residual: residual = x - centroid[c]. 3. Store (c, pq_code) in c's inverted list. Search phase: Given query q: 1. Find K_probe nearest cluster centroids to q. 2. For each chosen cluster: Compute query distance tables. For each (c, pq_code) in cluster's inverted list: distance_estimate = ||q - centroid||² + sum_m dist_table[m][pq_code[m]] 3. Return top-K overall. Performance (FAISS published benchmarks on 1B SIFT vectors): IVF + PQ16: ~5 ms/query at 90% recall@10. IVF + PQ32: ~10 ms/query at 95% recall@10. IVF + Flat (no PQ): ~50 ms/query at 99% recall@10. Memory: PQ16 → 16 bytes/vector → 16 GB for 1B vectors. Fits on one machine. Flat fp32 → 4 KB/vector → 4 TB. Need cluster.

— think, then check —

Setup:

10B vectors, 1024 dim. fp32 raw = 40 TB. Doesn’t fit on one machine.

Targets: 1K QPS, sub-50ms latency, 90%+ recall.

Compression:

PQ with M = 32 codes per vector, K = 256 codewords. Each vector → 32 bytes.

10B × 32 bytes = 320 GB. Fits on a single high-memory CPU server (or 4 H100s).

Expected recall@10 with PQ32: ~92-95%.

IVF clustering:

N_clusters = √N ≈ 100K clusters.

Each cluster has ~100K vectors on average.

K_probe = 64 (probe 64 nearest clusters per query). Searches ~6.4M vectors per query (down from 10B — 1500× reduction).

Per-query work: scan 6.4M PQ codes = 6.4M × 32 byte lookups = ~200 MB of memory accessed. At 100 GB/s memory bandwidth: 2 ms.

Plus cluster centroid search: K_probe nearest among 100K centroids = brute-force with PQ-compressed centroids ~ 1 ms.

Total per query: ~3-5 ms.

Hardware:

1 high-memory server (1 TB RAM) holds the full PQ index.
Or 2-4 GPUs with FAISS GPU backend (faster per-query but smaller capacity per GPU).
1K QPS throughput: ~1 server can handle if per-query ~3 ms (~300 QPS per core × 8 cores).

Replicate for redundancy + horizontal scaling: 3-5 servers serving 1K QPS each.

Refinement (optional):

For top-K candidates from PQ search, re-rank with EXACT distance on uncompressed vectors. This recovers some recall lost to PQ.

If we keep the full 1024-dim vectors on a separate storage tier (NVMe or even compressed), we can do re-ranking on the top 100 candidates per query.

Re-ranking cost: 100 × 1024-dim dot products = trivial. Adds 1 ms per query.

Final recall@10: 95-97% with re-ranking.

What this costs:

Storage: 320 GB PQ index + ~40 TB optional full embeddings (cheap, on NVMe).
Compute: 3-5 servers × $5K/year cloud cost = ~$25K/year.
Throughput: 1K QPS sustainable. 10K QPS achievable with more replicas.

For perspective: this would have been impossible in 2015 (memory and compute). PQ + IVF + HNSW have democratised billion-scale ANN.

↳ §26.2 + production deployment

Next: §26.3 — RaBitQ and TurboQuant. The very recent rotation-based quantization techniques that brought us back to the conversation that started this book.