The Johnson–Lindenstrauss lemma
Ch.6 §1 told the bad story: in high dimensions, distance is broken. Ch.6 §2 told the good story: high-dimensional Gaussians live on a thin shell, and smooth functions of many independent variables concentrate. Now we extract the most useful constructive consequence of the good story: you can take any high-D dataset, project it down through a random linear map to a much lower dimension, and pairwise distances are approximately preserved. The number of projected dimensions you need scales with log N (number of points) and 1/ε² (precision) — but is independent of the original dimension d. That dimension-free claim, due to Johnson & Lindenstrauss 1984, is the entire reason random-projection methods exist. Vector search (Ch.25), QJL (the JL-corrected TurboQuant variant), and many other compression schemes in modern ML are direct applications.
The statement
The JL lemma, in its modern (Achlioptas) constructive form:
Read the conclusion line out loud: the squared distance between any pair of original points, scaled into the projected space, is within a factor of (1 ± ε) of the original squared distance. No distortion exceeds ε. The projection is near-isometric on the dataset.
Three things to register about that k = O(log N / ε²) formula:
- It depends on N, not d. A million points in 10,000 dimensions can be projected to k ≈ 14 / ε² dimensions (since \log 10⁶ ≈ 14). Drop ε to 0.1 and you need ~1,400 projected dims. Even one billion points needs only ~2,100 dims at ε = 0.1. The original d is completely irrelevant to the result.
- It’s logarithmic in N. Doubling the number of points only adds \log 2 / ε² dimensions. A dataset that’s 1000× larger only needs 10/ε² more dimensions.
- The 1/ε² dependence is sharp. Halving the distortion budget costs 4× the dimensions. This is the standard ε² rate seen everywhere concentration-of-measure arguments produce bounds.
Drag d wildly while leaving k alone. The histogram of pairwise-distance ratios stays put — exactly the dimension-free claim. Drag k up and the histogram tightens. The empirical std of the ratio distribution matches 1/√k closely.
For N points in ℝᵈ and any ε > 0, there’s a linear map P : ℝᵈ → ℝᵏ with k = O(log N / ε²) such that all pairwise squared distances are preserved within a (1 ± ε) factor. A random Gaussian or ±1 projection works with high probability.
k depends on N (logarithmically) and ε (as 1/ε²). It is independent of d. That’s the surprising and useful part: you can project 100,000-dim data to 200-dim and preserve pairwise distances to ~5%, regardless of how large the original dimension was.
Why this works — the high-D shell, applied
The proof is one calculation away from Ch.6 §2’s concentration of measure. Here it is in three lines, for a single fixed vector x ∈ ℝᵈ.
Let P ∈ ℝᵏˣᵈ have i.i.d. entries P_ij ∼ N(0, 1/k). Then Px ∈ ℝᵏ. The i-th coordinate of Px is a linear combination of standard normals — itself a normal:
So each coordinate of the projection is a Gaussian with variance proportional to ‖x‖² / k. The squared norm of the projection is a sum of squares of these:
This is a chi-squared random variable scaled by ‖x‖²/k. Its mean is exactly ‖x‖² (the projection preserves squared norm on average) and its variance is 2‖x‖⁴/k. By a standard sub-Gaussian bound:
The probability of more than ε distortion shrinks exponentially in k. To get all N(N−1)/2 ≈ N²/2 pairs preserved simultaneously, union-bound:
That last line is the JL lemma. The structure is exactly Ch.6 §2’s “smooth functions of many independent variables concentrate” — the smooth function here is the squared norm of Px, the many independent variables are P’s entries, and concentration kicks in fast enough that all pairs survive the union bound.
The bound from Ch.6 §3 also explains why “random pairs of unit vectors are nearly orthogonal” — same calculation in disguise. Both are concentration-of-measure results about high-D Gaussians, with the JL bound being a uniform-over-pairs version. (Vempala 2004, “The Random Projection Method,” AMS, ties them together formally.)
Now make it run
The kernel projects 100 random Gaussian points from d ∈ {32, 128, 512, 2048} down to k = 64 and measures the empirical distortion. The dimension-free claim shows up cleanly:
Johnson-Lindenstrauss distortion (N=100 points, k=64 projected dims)
d mean ratio std ratio min ratio max ratio max |1−r|
32 0.9815 0.0916 0.6939 1.2850 0.3061
128 1.0009 0.0872 0.7194 1.3072 0.3072
512 1.0016 0.0893 0.6825 1.3619 0.3619
2048 1.0150 0.0971 0.7041 1.3735 0.3735
Predicted std ≈ 1/√k = 0.1250 for k = 64.
The std hovers at ~0.09 across a 64× change in d. The dimension-free claim is empirically airtight. The worst-case distortion is around 35% — acceptable for many applications, and would shrink to ~10% if k were boosted to 256.
double s = 0;
for (int i = 0; i < n; i++) { double v = a[i] - b[i]; s += v * v; }
return s;
}
int main(void) {
const int N = 100;
int ds[] = { 32, 128, 512, 2048 };
int k = 64;
printf("Johnson-Lindenstrauss distortion (N=%d points, k=%d projected dims)\n", N, k);
printf("Sweep d to check the dimension-free claim:\n\n");
printf("%-8s %-14s %-14s %-14s %-14s %-14s\n",
"d", "mean ratio", "std ratio", "min ratio", "max ratio", "max |1−r|");
for (size_t di = 0; di < sizeof(ds)/sizeof(ds[0]); di++) {
int d = ds[di];
double* X = malloc(sizeof(double) * (size_t)N * d);
double* Y = malloc(sizeof(double) * (size_t)N * k);
for (int i = 0; i < N * d; i++) X[i] = normal();
random_project(X, N, d, Y, k);
double sum = 0, sum2 = 0, lo = INFINITY, hi = 0, maxdev = 0;
long pairs = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++) {
double dX = sqrt(dist2(X + i * d, X + j * d, d));
double dY = sqrt(dist2(Y + i * k, Y + j * k, k));
double r = dY / dX;
sum += r;
sum2 += r * r;Engineer B is right. The JL lemma’s projected dimension k depends on the number of points N (logarithmically) and the precision tolerance ε (as 1/ε²) — but is completely independent of the original dimension d.
For a typical vector-search workload with N = 10 million points and ε = 0.1: k ≈ log(10⁷) / 0.01 ≈ 1,600 dimensions. So 4096-dim embeddings can be projected to ~1,600 dims and pairwise distances are preserved within 10%. Same projected dim if the originals were 100,000-dim — d doesn’t appear in the bound.
Operationally: random projection is the cheapest, simplest, most universally-applicable dimensionality reduction. The structure of the data doesn’t matter, the original dimension doesn’t matter — only N and ε. This is what makes it the preprocessing step inside QJL, TurboQuant (Ch.25), and many other compression schemes.
Why this is “the rotation result of Ch.6, made constructive”
Ch.6 §3 established that random unit vectors in high dimensions are nearly orthogonal — that their pairwise dot products concentrate sharply at zero with width 1/√d. JL is the constructive version of that fact: a random linear map preserves pairwise dot products and distances with high probability. The same concentration that made the geometry of random vectors predictable in §6.3 is now harnessed as an algorithm.
Stand them next to each other:
- Ch.2 §3 — Orthogonal Q preserves dot products exactly. But orthogonal Q is a square matrix (d × d); no dimension reduction.
- Ch.6 §3 — Random pairs of unit vectors have near-zero dot product. Descriptive, not algorithmic.
- Ch.7 §1 — Random k × d projection (k ≪ d) preserves dot products approximately with high probability. Approximate + dimension-reducing + algorithmic. This is the JL form.
Each is a fact about how randomness and orthogonality interact in high dimensions. The JL form is the one with the most engineering punch — it says “compress by a random matrix, lose almost nothing.”
Why not PCA?
A reasonable question. PCA projects onto the top-k principal components — the directions of largest variance in the data. By construction, PCA is optimal in the sense of minimising reconstruction error (Frobenius norm of the difference). Why not always use PCA?
Three reasons JL is often the better choice:
| JL random projection | PCA | |
|---|---|---|
| Cost to construct | O(dk) (just sample P) | O(d² N) (covariance + eigendecomposition) |
| Data dependence | None (use the same P for any data) | Need the data first |
| New data | Same P works | Need re-decomposition (or projection onto stale axes) |
| Distortion bound | (1 ± ε) on every pair, guaranteed | Optimal in average squared error, but no per-pair guarantee |
| Streaming-friendly | Yes (P is fixed) | No (covariance changes as data comes in) |
| Adversarial robustness | Random projection is unpredictable to adversary | Top-k directions are known/learnable |
For applications where you need a guaranteed worst-case bound across all pairs (ANN search, fairness), JL wins. For applications where average reconstruction error matters (compression of a known dataset), PCA wins. In modern ML, JL-style random projections dominate the inference path (TurboQuant, LSH-based ANN, sketching for kernel methods); PCA dominates the data-preprocessing path (whitening features before training, dimension reduction for visualisation).
(1) Construction cost. PCA needs the covariance of the data, O(d²N) flops, and an eigendecomposition. For N = 10⁹ vectors of d = 4096, that’s ~10¹⁶ flops just to set up the projection. JL: sample a random k×d matrix once — O(dk) ≈ 10⁶ flops for k = 256. Eight orders of magnitude difference.
(2) Freshness. PCA’s principal components depend on the data; if the data distribution drifts (new categories, seasonal patterns), PCA must be redecomposed or it becomes stale. JL is data-independent — the same random P works for any data, forever. No re-fitting.
(3) Worst-case guarantee. PCA minimises average reconstruction error in the Frobenius norm but gives no per-pair distance guarantee — some pairs could be highly distorted even when average reconstruction looks great. JL gives a uniform (1 ± ε) bound on every pair with probability ≥ 1 − N⁻². For ANN search where worst-case rankings matter, this guarantee is what you actually need.
For this workload — large streaming N, distance-preservation guarantees needed, fast setup required — JL is decisively better. PCA fits a different niche (fixed-dataset compression, visualisation, whitening). The two methods aren’t competing; they’re answering different questions about high-D data.
END OF CH.7 §1 — The Johnson-Lindenstrauss lemma.
Built: JLDistortion viz (slide d, watch the distortion histogram stay put; slide k, watch it tighten); jl_distortion.c demonstrates the dimension-free claim across d = 32 to d = 2048 at fixed k = 64. Three recall items: easy (statement), medium (the dimension-free claim and a concrete engineer-vs-engineer scenario), hard (JL vs PCA trade-offs for streaming workloads).
Coming next: §7.2 — The proof in detail and the Achlioptas ±1 construction. §7.3 will cover Fast JL via Hadamard, closing the loop to TurboQuant.