VECTOR SEARCH & ANN
Section 26.2
02

Product quantization + binary codes

HNSW gives you fast TRAVERSAL but doesn’t reduce per-vector storage or distance-computation cost. At 1B vectors × 1024 dim × fp16: that’s 2 TB of memory just for the embeddings. Untenable on any single machine. Product quantization (PQ) (Jegou 2011) compresses each vector into a small 8-32 byte code, reducing memory by 50-100×. Binary embeddings push further: 1 bit per dimension, distance via popcount. Combined with HNSW or IVF (inverted file with coarse clustering), these form the production billion-scale ANN recipe used in FAISS, Milvus, and every major vector database.

Product Quantization (PQ)

The idea: split each d-dim vector into M subvectors of dim d/M; quantize each subvector to a learned codebook entry.

PQ setup: d-dim vector x = [x_1, x_2, ..., x_M] where x_m ∈ ℝ^{d/M}. For each subvector position m, train a codebook C_m ∈ ℝ^{K × d/M} via k-means: C_m has K (typically 256) "codewords", each a d/M-dim centroid. Compression: for each subvector x_m, find its nearest codeword: code[m] = argmin_k || x_m - C_m[k] ||² Stored: M integers in [0, K). With K = 256: M bytes per vector. (vs original 4·d bytes for fp32) For d = 1024, M = 16, K = 256: 16 bytes per vector. 256× smaller than fp32. Search: Compute query subvectors: q = [q_1, q_2, ..., q_M]. Precompute distance tables: dist_table[m][k] = || q_m - C_m[k] ||² for all m, k. Tiny (M · K = 4096 floats). For each database vector: distance_estimate = sum_m dist_table[m][code[m]] This is the LOOKUP-TABLE trick: per-database-vector cost is M memory reads + M sums. For M = 16: ~16 ops per distance vs original ~2048. ~100× faster. Accuracy: estimated distance is within ~2-5% of exact for typical embeddings.

PQ is the workhorse of large-scale ANN. It compresses 50-200× while preserving most of the nearest-neighbor information.

Binary embeddings — 1 bit per dimension

The extreme version: 1 bit per element. Each vector becomes a d-bit binary string.

Binary embedding: For each vector x ∈ ℝ^d: binarise by sign: b[i] = 1 if x[i] > 0 else 0 (or learned threshold) Storage: d/8 bytes per vector. For d = 1024: 128 bytes. Distance: Hamming distance (number of differing bits): hamming(b, q) = popcount(b XOR q) On modern CPUs: POPCNT instruction handles 64 bits per cycle. For d = 1024: 16 popcount ops per pair = 16 cycles = ~5 ns. Compare to fp32 dot product on the same vectors: ~50 ns. ~10× faster. Accuracy: binary embedding loses MOST of the magnitude information. For typical embeddings, recall@10 drops from 95% (exact) to 30-50% (binary). Mitigated by: - Use binary as a coarse filter; re-rank top-K with exact distance. - Use multi-bit codes (2-4 bits per dim) for better accuracy. Best use case: extremely-large-scale retrieval where you can afford recall drop.
— think, then check —

Budget:

100M vectors → 8 GB → 80 bytes per vector. (Going from 400B/vector down to 80B/vector — 5× compression.)

Wait, that’s modest compression. Let me recheck: fp32 = 4 bytes/dim × 1024 dim = 4096 bytes per vector. To fit in 80 bytes per vector: 51× compression.

PQ parameters:

If we use K = 256 codewords per subvector (1 byte per subvector index):

Storage per vector = M bytes (one byte per subvector).

To fit 80 bytes: M ≤ 80. Let’s use M = 64.

Subvector dim = 1024 / 64 = 16 dimensions per subvector.

Codebooks: M × K × d/M = 64 × 256 × 16 = 256K floats × 4 bytes = 1 MB total. Negligible.

Expected accuracy:

With M = 64, K = 256: each 16-dim subvector is quantized to one of 256 representative values. Quantization error per subvector ≈ 16 × σ² / 256 ≈ σ²/16 where σ is the typical subvector value scale.

Per-vector distance estimation error: ~ M × σ²/16 = 4σ². The total distance is roughly d × σ² = 1024σ². So distance error is ~0.4% of typical distance. Very good.

Empirically: PQ at M = 16-32 typically achieves 85-95% recall@10 on standard benchmarks; M = 64 gives 95%+.

Storage:

Per vector: 64 bytes.

Total for 100M: 6.4 GB. Fits in 8 GB budget with room for index structure.

Search cost:

Per query: build distance table (M × K = 16K float subtractions + sums = trivial).

Per database vector: 64 table lookups + 64 additions = ~128 cycles ≈ 40 ns on modern CPU.

For 100M lookups: 4 seconds linear scan. Still slow — need to combine with HNSW or IVF for efficient navigation.

Combining with IVF:

IVF partitions the database into ~10K clusters. Each query probes ~16-64 clusters (the closest to query). Reduces search space by 100-600×.

Combined: IVF + PQ = millions of vectors searched in ~10ms. Industry-standard recipe.

The full recipe (FAISS IVF+PQ):

  1. Train: cluster vectors into N_clusters centroids. Train PQ codebooks.
  2. Index: assign each vector to its nearest cluster. Store its PQ code in the cluster’s inverted list.
  3. Query: find query’s top K_probe nearest clusters. Search each cluster’s inverted list with PQ distance estimation. Return top-K overall.

Tunable: N_clusters (typically √N, e.g., 10K for 100M); K_probe (typically 16-128); M, K for PQ.

The lookup-table trick

The PQ distance computation is critical to understand:

Standard distance: d(q, x) = sum_{i=1..d} (q[i] - x[i])² PQ distance with M codes: d_pq(q, x) = sum_{m=1..M} dist_table[m][x_code[m]] where dist_table[m][k] = || q_m - C_m[k] ||² (precomputed per query) Per-database-vector cost: - M table reads (each ~5 ns from L1/L2 cache). - M additions (~1 ns each). - Total: ~6M ns per distance estimate. For M = 16: ~100 ns per distance. For d = 1024 fp32 distance: ~2000 ns. PQ is 20× faster. Memory bandwidth: PQ code: M bytes per vector → memory bandwidth limited by M, not d. For d = 1024 → M = 16: 64× less HBM traffic per database vector. Lookup tables fit in L1/L2 cache → not memory-bound. This is why PQ scales: the bottleneck shifts from "compute distance to d dim vector" to "look up M small numbers per vector" — a fundamentally cheaper operation.

Binary embeddings revisited

Modern advance: quantization-aware embedding training aligns the embedding training to be MORE TOLERANT of quantization. The Cohere “binary” and “int8” embedding offerings train with quantization in mind, achieving 80-90% recall at 1 bit/dim compression.

Binary embedding production realities: - Cohere v3 binary: 1024-dim → 128 bytes/vector. 32× compression vs fp32. - Recall@10: 87% on STS-B with binary vs 91% on fp32. - Search throughput: 50× faster than fp32 dot product on CPU. - Hybrid: use binary for the first-pass filter, re-rank top-100 with exact distance. When binary wins: - Extremely high-throughput retrieval (millions of QPS). - Heavy-tailed budgets where memory is the constraint. - When you can re-rank top-K with the original embeddings.
— think, then check —

Per-pair Hamming distance:

For two 1024-bit binary vectors a and b:

1. XOR: a ⊕ b. Bits that differ become 1.

2. Popcount: count the 1 bits in the result.

On modern CPUs with POPCNT instruction:

  • 1024 bits = 16 × 64-bit words.
  • 16 × XOR (1 cycle each).
  • 16 × POPCNT (3 cycles each on Intel).
  • 15 × ADD to sum the popcounts.
  • Total: ~80 cycles ≈ 25-30 ns per pair.

Why faster than dot product:

fp32 dot product for 1024 dims: 1024 multiplies + 1023 adds = ~2000 FLOPs. Modern CPU: ~500-1000 ns due to memory bandwidth (must load 4 KB of vector data).

Binary Hamming: only 128 bytes of vector data; processed at hardware-popcount speed. ~25 ns.

Speedup: 20-40× for in-cache data.

GPU equivalent:

NVIDIA H100 has dedicated POPC instructions and SIMD popcount:

  • Each thread can process 32 bits per cycle with __popc.
  • For 1024-bit vector: 32 threads × 32 bits per cycle = 32 cycles.
  • At 1.5 GHz: ~21 ns per pair per thread.
  • With 132 SMs × 64 warps × 32 threads in parallel: billions of distance pairs per second per H100.

Specialised libraries (FAISS GPU, cuVS) use this for high-throughput binary search.

When you’d use binary:

  • Massive scale (1B+ vectors).
  • CPU-only deployment (binary is much friendlier to commodity hardware than fp16 PQ).
  • First-pass filtering: find candidates with binary, re-rank with exact distance.
  • Storage-constrained systems (mobile, edge).

FAISS IVF+PQ — the production recipe

FAISS IVF+PQ is the canonical production recipe, combining IVF (inverted file) with PQ:

FAISS IVF+PQ structure: Train phase: 1. K-means cluster the database into N_clusters centroids (typically √N). 2. For each cluster, train PQ codebooks on its assigned vectors. Index phase: For each database vector x: 1. Find its nearest cluster c. 2. Compute PQ code of x relative to c's residual: residual = x - centroid[c]. 3. Store (c, pq_code) in c's inverted list. Search phase: Given query q: 1. Find K_probe nearest cluster centroids to q. 2. For each chosen cluster: Compute query distance tables. For each (c, pq_code) in cluster's inverted list: distance_estimate = ||q - centroid||² + sum_m dist_table[m][pq_code[m]] 3. Return top-K overall. Performance (FAISS published benchmarks on 1B SIFT vectors): IVF + PQ16: ~5 ms/query at 90% recall@10. IVF + PQ32: ~10 ms/query at 95% recall@10. IVF + Flat (no PQ): ~50 ms/query at 99% recall@10. Memory: PQ16 → 16 bytes/vector → 16 GB for 1B vectors. Fits on one machine. Flat fp32 → 4 KB/vector → 4 TB. Need cluster.
— think, then check —

Setup:

10B vectors, 1024 dim. fp32 raw = 40 TB. Doesn’t fit on one machine.

Targets: 1K QPS, sub-50ms latency, 90%+ recall.

Compression:

PQ with M = 32 codes per vector, K = 256 codewords. Each vector → 32 bytes.

10B × 32 bytes = 320 GB. Fits on a single high-memory CPU server (or 4 H100s).

Expected recall@10 with PQ32: ~92-95%.

IVF clustering:

N_clusters = √N ≈ 100K clusters.

Each cluster has ~100K vectors on average.

K_probe = 64 (probe 64 nearest clusters per query). Searches ~6.4M vectors per query (down from 10B — 1500× reduction).

Per-query work: scan 6.4M PQ codes = 6.4M × 32 byte lookups = ~200 MB of memory accessed. At 100 GB/s memory bandwidth: 2 ms.

Plus cluster centroid search: K_probe nearest among 100K centroids = brute-force with PQ-compressed centroids ~ 1 ms.

Total per query: ~3-5 ms.

Hardware:

  • 1 high-memory server (1 TB RAM) holds the full PQ index.
  • Or 2-4 GPUs with FAISS GPU backend (faster per-query but smaller capacity per GPU).
  • 1K QPS throughput: ~1 server can handle if per-query ~3 ms (~300 QPS per core × 8 cores).

Replicate for redundancy + horizontal scaling: 3-5 servers serving 1K QPS each.

Refinement (optional):

For top-K candidates from PQ search, re-rank with EXACT distance on uncompressed vectors. This recovers some recall lost to PQ.

If we keep the full 1024-dim vectors on a separate storage tier (NVMe or even compressed), we can do re-ranking on the top 100 candidates per query.

Re-ranking cost: 100 × 1024-dim dot products = trivial. Adds 1 ms per query.

Final recall@10: 95-97% with re-ranking.

What this costs:

  • Storage: 320 GB PQ index + ~40 TB optional full embeddings (cheap, on NVMe).
  • Compute: 3-5 servers × $5K/year cloud cost = ~$25K/year.
  • Throughput: 1K QPS sustainable. 10K QPS achievable with more replicas.

For perspective: this would have been impossible in 2015 (memory and compute). PQ + IVF + HNSW have democratised billion-scale ANN.

Next: §26.3 — RaBitQ and TurboQuant. The very recent rotation-based quantization techniques that brought us back to the conversation that started this book.