Exact kNN → HNSW small-world graphs
Every RAG system you’ve used in 2023+ runs on vector search. The query text becomes an embedding; the database has billions of pre-computed embeddings; you need to find the K nearest. Doing this exactly (compare against every database vector) is O(N · d) per query — for N = 1 billion and d = 1024, that’s a trillion FLOPs per query. Untenable. The field invented approximate nearest neighbor (ANN) algorithms that trade some recall for orders-of-magnitude speedup. Of these, HNSW (Malkov 2018, “Hierarchical Navigable Small World”) is the dominant graph-based approach. It builds a multi-layer graph where higher layers are sparse “highways” and the bottom layer has many short-range connections; search descends from top to bottom via greedy navigation. 10-100× speedup at 90%+ recall.
The exact kNN baseline
For a query q ∈ ℝ^d and database of N vectors:
The brute-force baseline becomes impractical past ~10M vectors. For production systems, approximate methods are mandatory.
The curse of dimensionality (revisited from Ch.6)
Why is high-dim vector search particularly hard? Part of the answer comes from Ch.6 §1:
Why high-d is structurally harder:
In 2D space, a query point has clear “neighbors” — points within distance r form a disk of radius r. The expected number of database points within that disk is small (if data is spread uniformly).
In 1024D, the volume of a “ball” of radius r grows as r^1024. To contain enough data points to be the nearest neighbor, r must be only slightly less than the typical pairwise distance. The “nearest” point is only marginally distinguishable from a random point.
From Ch.6: for random unit vectors in d-dim space, mean pairwise distance is √2 with variance ~1/d. As d grows, distances concentrate; the nearest neighbor signal weakens.
Strategies that work in 2D but fail in high-d:
- kd-trees: recursively split space along axes. Works in 2-10D. In high-d, the tree degenerates: each split barely separates the data; queries traverse most of the tree.
- Hashing into spatial bins: in 2D, divide the plane into a 100×100 grid; each cell is one bin. In 1024D, dividing each axis into 2 cells gives 2¹⁰²⁴ bins — most empty, queries land in one bin that contains nothing or everything.
- R-trees / ball trees: hierarchical bounding boxes. In low-d, work well. In high-d, the bounding boxes are nearly always overlapping; pruning fails.
- Locality-Sensitive Hashing (LSH): approximate. Works at recall ~0.5 in moderate-d but degrades in high-d. Surpassed by graph and quantization methods.
What works in high-d:
- Graph-based methods (HNSW, NSG, DiskANN): the data points themselves form a navigable graph. Search follows edges, not coordinates. Robust to high-d because navigation uses pairwise distances directly.
- Quantization-based methods (PQ, RaBitQ): compress each vector into a small code; compare codes (cheaply); refine with exact distance on top candidates.
- Hybrid (FAISS IVF + PQ): coarse clustering (IVF) + fine quantization (PQ). Industry standard for billion-scale search.
The fundamental shift: in high-d, you can’t EXCLUDE most of the data via geometric reasoning (the geometry is too uniform). You must EXAMINE most of it but cheaply, using compressed representations and intelligent traversal.
HNSW — small-world graph navigation
Malkov 2018 (“Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs”) introduced HNSW. The structure:
HNSW’s key insight: small-world graphs have the property that any two nodes are connected by O(log N) hops. The multi-layer structure provides “fast lanes” at the top for quick navigation to the right neighborhood, then short-range edges at the bottom for precise refinement.
Now make it run
The kernel implements a simplified single-layer HNSW (the bottom layer’s greedy graph search) on 10K random 64-dim vectors:
float best_d = INFINITY;
for (int j = 0; j < N; j++) {
if (dist[j] < best_d) { best_d = dist[j]; best = j; }
}
if (best != i) g->neighbors[(size_t)i * M + picked++] = best;
dist[best] = INFINITY;
}
}
free(idx); free(dist);
}
/* Greedy search: maintain a "best-so-far" set of EF candidates ranked by
distance to q. At each step, pop the closest UNEXPLORED candidate and
examine its neighbors. Stop when all EF candidates have been explored. */
static int* search_graph(const graph_t* g, const float* data, const float* q,
int entry, int* found_count)
{
int* visited = calloc((size_t)N, sizeof(int));
/* Sorted candidate list (insertion sort keeps it sorted). */
int candidates[EF];
float cand_dist[EF];
int explored_flag[EF] = {0};
int n_cand = 0;
/* Initial candidate: entry point */
cand_dist[0] = l2_dist(q, data + (size_t)entry * D);
candidates[0] = entry;
n_cand = 1;
visited[entry] = 1;
/* Iterate: pop closest unexplored candidate, expand neighbors. */
for (int iter = 0; iter < EF * 4; iter++) {
/* Find closest unexplored. */
int pop_idx = -1;
float pop_d = INFINITY;
for (int i = 0; i < n_cand; i++) {
if (!explored_flag[i] && cand_dist[i] < pop_d) {
pop_d = cand_dist[i];
pop_idx = i;
}
}
if (pop_idx < 0) break; /* nothing to explore */
explored_flag[pop_idx] = 1;
int node = candidates[pop_idx];
/* Look at neighbors. */
const int* nbrs = g->neighbors + (size_t)node * M;
for (int j = 0; j < M; j++) {
int n = nbrs[j];
if (visited[n]) continue;
visited[n] = 1;
float d = l2_dist(q, data + (size_t)n * D);
/* Insert into candidate list if it's closer than the worst current. */
if (n_cand < EF) {
candidates[n_cand] = n;
cand_dist[n_cand] = d;
explored_flag[n_cand] = 0;
n_cand++;
} else {
/* Find the worst, replace if better. */
int worst_idx = 0;
float worst_d = cand_dist[0];Output:
HNSW vs exact k-NN: N=10000, d=64, M=8, EF=64, K=10
Building HNSW graph...
Graph built in 2.08 seconds.
100 queries:
Exact k-NN: 0.030 seconds (303.0 μs/query)
HNSW search: 0.003 seconds (30.0 μs/query)
Speedup: 10.1x
Recall@10: 52.6% (526 / 1000 nearest neighbors found)
10× speedup at 53% recall. The recall is lower than a full multi-layer HNSW would achieve (real HNSW typically hits 95%+ at similar speedup) — my simplified single-layer version lacks the upper-layer “fast lanes” that quickly route the search to the right neighborhood. The kernel demonstrates the GREEDY GRAPH TRAVERSAL principle; production implementations layer on the hierarchy + tuning.
Brute force feasibility:
Per-query cost: 100M · 768 · 2 (dot product) = 153 GFLOPs.
On H100 at peak FLOPs (1 PFLOPs): 0.15 ms theoretical. But this is decode-mode (memory-bound), realistic ~10-20% utilisation: ~1-2 ms.
Memory: 100M · 768 · 2 bytes (fp16) = 153 GB. Doesn’t fit on one H100. Need 2× H100s with model parallelism, or use cheaper hardware with more RAM.
Brute force IS feasible for 100M at 1ms with H100-class hardware. But not great economics: serving 1000 QPS would need many GPUs just for similarity search.
HNSW:
Per-query: ~log(N) · d · ef ≈ ~30 · 768 · 100 = 2.3 MFLOPs. Trivial compute.
Memory: indices add ~10-20% overhead over raw vectors. ~180 GB.
Latency: ~0.1-1 ms per query. Recall@10: ~95% with good tuning.
Throughput: thousands of QPS per CPU server.
HNSW wins on cost-per-query. For 100M-scale RAG, this is the right answer.
Alternative: FAISS IVF+PQ:
For datasets approaching 1B+, HNSW’s memory overhead becomes prohibitive (it stores all vectors uncompressed). FAISS IVF+PQ adds quantization: store compressed vectors that take ~10× less memory.
At 100M: HNSW alone is fine.
At 1B+: IVF + PQ + HNSW for refinement is typical.
The decision:
Use HNSW. Industry-standard tools (FAISS, Milvus, Pinecone) provide this. Build offline, query online. Sub-100ms latency easily achievable.
For the rare extreme cases (10B+ vectors, billion QPS, ultra-tight latency budgets), you move to specialised hardware (GPUs for batch queries, AWS Inferentia-style accelerators for compressed-vector serving).
Layer assignment:
Each point is assigned a layer i with probability (1/M)^i. So:
- Every point appears in layer 0 (probability 1).
- ~1/M of points appear in layer 1.
- ~1/M² of points appear in layer 2.
- …
- ~1/M^L of points appear in the top layer.
For M = 16 and N = 1M: layer 0 = 1M points, layer 1 = 62K, layer 2 = 4K, layer 3 = 244, layer 4 = 15, layer 5 = 1.
This gives an exponentially-sparse hierarchy.
Connection structure within each layer:
Within layer i, each point is connected to its ~M nearest neighbors among points that are ALSO in layer i. Critically: the nearest neighbors at layer 1 are different from those at layer 0 (because layer 1 has fewer points, so the “nearest” is farther).
Result: layer 0 has dense local connections; layer 1 has sparser, longer-range connections; top layers are sparse highways.
Search dynamics:
- Top layer: a few points; do a brute-force-ish search to find the closest.
- Descend to layer i-1: use the closest point at layer i as the entry point for layer i-1. Greedy search at layer i-1 (hop to neighbors, find closer ones).
- Continue down: each layer’s search has the prior layer’s result as entry. Sparser layers do “long-range” exploration; denser layers do “fine-grained” search.
- At layer 0: the entry point is already near the true answer. The search just needs to refine. Maintain candidate set of EF size; expand greedily until no improvement.
Why this is small-world:
A “small world” graph has the property that diameter (longest shortest path) is O(log N).
HNSW’s hierarchy ensures this by construction: at each layer, the points are spread out (random subset). Long-range connections at higher layers act as “shortcuts” that let the search hop large distances quickly.
Formal analysis (Malkov 2018): search visits O(log N) points at each layer, and there are O(log N) layers. Total: O((log N)²) operations per query. For N = 10⁹: ~900 operations — sub-millisecond.
Insertion (build time):
To insert a new point:
- Sample its top layer (geometric distribution).
- From the entry point, search top-down for the closest layer points.
- At each layer the point will appear in: connect it to its M nearest neighbors at that layer.
- Bidirectional links: also update the neighbors to include the new point.
Insertion cost: O(log N · d · M). Building the full index for N points: O(N · log N · d · M). For 10⁹ vectors at d = 1024 and M = 16: ~16 hours on a single CPU. Done offline; not a serving-time cost.
Tunable parameters:
- M: connectivity per node. Higher M = more memory + more accurate search.
- ef_construction: candidate set during build. Higher = better build quality, slower build.
- ef_search: candidate set during query. Higher = better recall, slower query.
Production tuning: M = 16-64; ef_construction = 200; ef_search = 50-200 depending on recall target.
Next: §26.2 — Product quantization and binary codes. The complementary approach to HNSW: compress each vector into a small code, do search on codes. Combined with HNSW, the production billion-scale recipe.