VECTOR SEARCH & ANN
Section 26.1
01

Exact kNN → HNSW small-world graphs

Every RAG system you’ve used in 2023+ runs on vector search. The query text becomes an embedding; the database has billions of pre-computed embeddings; you need to find the K nearest. Doing this exactly (compare against every database vector) is O(N · d) per query — for N = 1 billion and d = 1024, that’s a trillion FLOPs per query. Untenable. The field invented approximate nearest neighbor (ANN) algorithms that trade some recall for orders-of-magnitude speedup. Of these, HNSW (Malkov 2018, “Hierarchical Navigable Small World”) is the dominant graph-based approach. It builds a multi-layer graph where higher layers are sparse “highways” and the bottom layer has many short-range connections; search descends from top to bottom via greedy navigation. 10-100× speedup at 90%+ recall.

The exact kNN baseline

For a query q ∈ ℝ^d and database of N vectors:

Exact kNN by brute force: for each x_i in database: d_i = || q - x_i ||² (or dot product, or cosine — choose your metric) sort by d_i; return top K. Cost per query: O(N · d) FLOPs for distance computation + O(N log K) for partial sort. Memory: O(N · d) for the database. For N = 10⁶, d = 768 (sentence-transformer typical): Per-query FLOPs: 1.5 · 10⁹. At H100 throughput: ~1.5 ms. At 100 queries/sec: 150 ms total compute. Tolerable. For N = 10⁹, d = 1024 (production-scale embedding store): Per-query FLOPs: 2 · 10¹². At H100: ~2 seconds. At 100 queries/sec: 200 seconds. Catastrophic.

The brute-force baseline becomes impractical past ~10M vectors. For production systems, approximate methods are mandatory.

The curse of dimensionality (revisited from Ch.6)

Why is high-dim vector search particularly hard? Part of the answer comes from Ch.6 §1:

In high dimensions: - Pairwise distances become nearly equal (mass of d-ball is on the surface). - The "nearest" vector is only marginally closer than the "average" vector. - Locality-sensitive techniques that work in 2-3D fail in 1024D. For random unit vectors in d dimensions: E[ || a - b ||² ] = 2 (independent of d). Var[ || a - b ||² ] ≈ 8/d (decreases with d). So in high d, all pairs have distance ~√2 with tiny variance. The "nearest" vector to a query is barely different from a random vector. Practical consequence: Any ANN method must reach recall ≥ 0.9 to be useful. Methods that achieve "approximate within 2-3x" of the nearest distance don't return useful results in high-d.
— think, then check —

Why high-d is structurally harder:

In 2D space, a query point has clear “neighbors” — points within distance r form a disk of radius r. The expected number of database points within that disk is small (if data is spread uniformly).

In 1024D, the volume of a “ball” of radius r grows as r^1024. To contain enough data points to be the nearest neighbor, r must be only slightly less than the typical pairwise distance. The “nearest” point is only marginally distinguishable from a random point.

From Ch.6: for random unit vectors in d-dim space, mean pairwise distance is √2 with variance ~1/d. As d grows, distances concentrate; the nearest neighbor signal weakens.

Strategies that work in 2D but fail in high-d:

  1. kd-trees: recursively split space along axes. Works in 2-10D. In high-d, the tree degenerates: each split barely separates the data; queries traverse most of the tree.
  2. Hashing into spatial bins: in 2D, divide the plane into a 100×100 grid; each cell is one bin. In 1024D, dividing each axis into 2 cells gives 2¹⁰²⁴ bins — most empty, queries land in one bin that contains nothing or everything.
  3. R-trees / ball trees: hierarchical bounding boxes. In low-d, work well. In high-d, the bounding boxes are nearly always overlapping; pruning fails.
  4. Locality-Sensitive Hashing (LSH): approximate. Works at recall ~0.5 in moderate-d but degrades in high-d. Surpassed by graph and quantization methods.

What works in high-d:

  1. Graph-based methods (HNSW, NSG, DiskANN): the data points themselves form a navigable graph. Search follows edges, not coordinates. Robust to high-d because navigation uses pairwise distances directly.
  2. Quantization-based methods (PQ, RaBitQ): compress each vector into a small code; compare codes (cheaply); refine with exact distance on top candidates.
  3. Hybrid (FAISS IVF + PQ): coarse clustering (IVF) + fine quantization (PQ). Industry standard for billion-scale search.

The fundamental shift: in high-d, you can’t EXCLUDE most of the data via geometric reasoning (the geometry is too uniform). You must EXAMINE most of it but cheaply, using compressed representations and intelligent traversal.

HNSW — small-world graph navigation

Malkov 2018 (“Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs”) introduced HNSW. The structure:

HNSW data structure: Multi-layer graph: Layer 0 (bottom): every point + ~M neighbors per point. Layer 1: random subset of points + ~M neighbors per layer. Layer 2: sparser still. ... Layer L (top): ~1-10 points, fully connected. Each point appears in layer i with probability (1/M)^i — exponentially sparser. Search algorithm: 1. Start at the entry point (a fixed top-layer node). 2. Descend through layers: at each layer, greedy search: expand neighbors, keep K closest to query. Move to the next layer using the closest found. 3. At layer 0: maintain a candidate set of EF points; expand until no improvement possible. 4. Return top K of the candidate set. Time complexity: Building: O(N · log N · d · M) — building requires search per insertion. Search: O(log N · d · M · EF) — typically 10-100× faster than O(N · d) brute force. Recall: ~95% recall@10 with M = 16, EF = 100 on typical datasets. Tunable: higher M, EF → higher recall + slower search.

HNSW’s key insight: small-world graphs have the property that any two nodes are connected by O(log N) hops. The multi-layer structure provides “fast lanes” at the top for quick navigation to the right neighborhood, then short-range edges at the bottom for precise refinement.

Now make it run

The kernel implements a simplified single-layer HNSW (the bottom layer’s greedy graph search) on 10K random 64-dim vectors:

hnsw.c — search_graph C · HNSW vs exact kNN
            float best_d = INFINITY;
            for (int j = 0; j < N; j++) {
                if (dist[j] < best_d) { best_d = dist[j]; best = j; }
            }
            if (best != i) g->neighbors[(size_t)i * M + picked++] = best;
            dist[best] = INFINITY;
        }
    }
    free(idx); free(dist);
}

/* Greedy search: maintain a "best-so-far" set of EF candidates ranked by
   distance to q. At each step, pop the closest UNEXPLORED candidate and
   examine its neighbors. Stop when all EF candidates have been explored. */
static int* search_graph(const graph_t* g, const float* data, const float* q,
                          int entry, int* found_count)
{
    int* visited = calloc((size_t)N, sizeof(int));
    /* Sorted candidate list (insertion sort keeps it sorted). */
    int   candidates[EF];
    float cand_dist[EF];
    int   explored_flag[EF] = {0};
    int n_cand = 0;

    /* Initial candidate: entry point */
    cand_dist[0] = l2_dist(q, data + (size_t)entry * D);
    candidates[0] = entry;
    n_cand = 1;
    visited[entry] = 1;

    /* Iterate: pop closest unexplored candidate, expand neighbors. */
    for (int iter = 0; iter < EF * 4; iter++) {
        /* Find closest unexplored. */
        int pop_idx = -1;
        float pop_d = INFINITY;
        for (int i = 0; i < n_cand; i++) {
            if (!explored_flag[i] && cand_dist[i] < pop_d) {
                pop_d = cand_dist[i];
                pop_idx = i;
            }
        }
        if (pop_idx < 0) break;   /* nothing to explore */
        explored_flag[pop_idx] = 1;
        int node = candidates[pop_idx];

        /* Look at neighbors. */
        const int* nbrs = g->neighbors + (size_t)node * M;
        for (int j = 0; j < M; j++) {
            int n = nbrs[j];
            if (visited[n]) continue;
            visited[n] = 1;
            float d = l2_dist(q, data + (size_t)n * D);

            /* Insert into candidate list if it's closer than the worst current. */
            if (n_cand < EF) {
                candidates[n_cand] = n;
                cand_dist[n_cand] = d;
                explored_flag[n_cand] = 0;
                n_cand++;
            } else {
                /* Find the worst, replace if better. */
                int worst_idx = 0;
                float worst_d = cand_dist[0];

Output:

HNSW vs exact k-NN: N=10000, d=64, M=8, EF=64, K=10

Building HNSW graph...
Graph built in 2.08 seconds.

100 queries:
  Exact k-NN: 0.030 seconds  (303.0 μs/query)
  HNSW search: 0.003 seconds  (30.0 μs/query)
  Speedup: 10.1x
  Recall@10: 52.6%  (526 / 1000 nearest neighbors found)

10× speedup at 53% recall. The recall is lower than a full multi-layer HNSW would achieve (real HNSW typically hits 95%+ at similar speedup) — my simplified single-layer version lacks the upper-layer “fast lanes” that quickly route the search to the right neighborhood. The kernel demonstrates the GREEDY GRAPH TRAVERSAL principle; production implementations layer on the hierarchy + tuning.

— think, then check —

Brute force feasibility:

Per-query cost: 100M · 768 · 2 (dot product) = 153 GFLOPs.

On H100 at peak FLOPs (1 PFLOPs): 0.15 ms theoretical. But this is decode-mode (memory-bound), realistic ~10-20% utilisation: ~1-2 ms.

Memory: 100M · 768 · 2 bytes (fp16) = 153 GB. Doesn’t fit on one H100. Need 2× H100s with model parallelism, or use cheaper hardware with more RAM.

Brute force IS feasible for 100M at 1ms with H100-class hardware. But not great economics: serving 1000 QPS would need many GPUs just for similarity search.

HNSW:

Per-query: ~log(N) · d · ef ≈ ~30 · 768 · 100 = 2.3 MFLOPs. Trivial compute.

Memory: indices add ~10-20% overhead over raw vectors. ~180 GB.

Latency: ~0.1-1 ms per query. Recall@10: ~95% with good tuning.

Throughput: thousands of QPS per CPU server.

HNSW wins on cost-per-query. For 100M-scale RAG, this is the right answer.

Alternative: FAISS IVF+PQ:

For datasets approaching 1B+, HNSW’s memory overhead becomes prohibitive (it stores all vectors uncompressed). FAISS IVF+PQ adds quantization: store compressed vectors that take ~10× less memory.

At 100M: HNSW alone is fine.

At 1B+: IVF + PQ + HNSW for refinement is typical.

The decision:

Use HNSW. Industry-standard tools (FAISS, Milvus, Pinecone) provide this. Build offline, query online. Sub-100ms latency easily achievable.

For the rare extreme cases (10B+ vectors, billion QPS, ultra-tight latency budgets), you move to specialised hardware (GPUs for batch queries, AWS Inferentia-style accelerators for compressed-vector serving).

— think, then check —

Layer assignment:

Each point is assigned a layer i with probability (1/M)^i. So:

  • Every point appears in layer 0 (probability 1).
  • ~1/M of points appear in layer 1.
  • ~1/M² of points appear in layer 2.
  • ~1/M^L of points appear in the top layer.

For M = 16 and N = 1M: layer 0 = 1M points, layer 1 = 62K, layer 2 = 4K, layer 3 = 244, layer 4 = 15, layer 5 = 1.

This gives an exponentially-sparse hierarchy.

Connection structure within each layer:

Within layer i, each point is connected to its ~M nearest neighbors among points that are ALSO in layer i. Critically: the nearest neighbors at layer 1 are different from those at layer 0 (because layer 1 has fewer points, so the “nearest” is farther).

Result: layer 0 has dense local connections; layer 1 has sparser, longer-range connections; top layers are sparse highways.

Search dynamics:

  1. Top layer: a few points; do a brute-force-ish search to find the closest.
  2. Descend to layer i-1: use the closest point at layer i as the entry point for layer i-1. Greedy search at layer i-1 (hop to neighbors, find closer ones).
  3. Continue down: each layer’s search has the prior layer’s result as entry. Sparser layers do “long-range” exploration; denser layers do “fine-grained” search.
  4. At layer 0: the entry point is already near the true answer. The search just needs to refine. Maintain candidate set of EF size; expand greedily until no improvement.

Why this is small-world:

A “small world” graph has the property that diameter (longest shortest path) is O(log N).

HNSW’s hierarchy ensures this by construction: at each layer, the points are spread out (random subset). Long-range connections at higher layers act as “shortcuts” that let the search hop large distances quickly.

Formal analysis (Malkov 2018): search visits O(log N) points at each layer, and there are O(log N) layers. Total: O((log N)²) operations per query. For N = 10⁹: ~900 operations — sub-millisecond.

Insertion (build time):

To insert a new point:

  1. Sample its top layer (geometric distribution).
  2. From the entry point, search top-down for the closest layer points.
  3. At each layer the point will appear in: connect it to its M nearest neighbors at that layer.
  4. Bidirectional links: also update the neighbors to include the new point.

Insertion cost: O(log N · d · M). Building the full index for N points: O(N · log N · d · M). For 10⁹ vectors at d = 1024 and M = 16: ~16 hours on a single CPU. Done offline; not a serving-time cost.

Tunable parameters:

  • M: connectivity per node. Higher M = more memory + more accurate search.
  • ef_construction: candidate set during build. Higher = better build quality, slower build.
  • ef_search: candidate set during query. Higher = better recall, slower query.

Production tuning: M = 16-64; ef_construction = 200; ef_search = 50-200 depending on recall target.

Next: §26.2 — Product quantization and binary codes. The complementary approach to HNSW: compress each vector into a small code, do search on codes. Combined with HNSW, the production billion-scale recipe.