Gaussian, ±1, and sparse projections
§7.1 stated the JL lemma with a Gaussian random projection: Pᵢⱼ ∼ N(0, 1/k). Beautiful theorem, but operationally awkward — you need a Gaussian RNG, you need 32 bits per matrix entry, and the projection step is a dense matmul of float entries. Achlioptas (2003) proved the entries can be ±1/√k uniformly — no multiplications needed in the projection, just additions and subtractions. Li, Hastie & Church (2006) made the projection sparse: 2/3 of the entries can be zero, with the surviving ones scaled up to compensate. Both deliver the same distortion bound as the Gaussian. The lesson — and the reason every production JL kernel uses one of the discrete forms — is that the smoothness of the random entries doesn’t matter; only their independence and second moment do.
Achlioptas’s ±1 projection
Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins” (PODS 2001, JCSS 2003). Replace the Gaussian distribution over projection entries with a discrete one:
The proof tracks the Gaussian proof exactly, with the second moment of a Bernoulli ±1 (= 1) substituting for the Gaussian second moment. Achlioptas’s lemma gives:
The operational difference is dramatic. Computing (Px)_i = Σⱼ ±x_j/√k requires only additions and subtractions (the 1/√k scaling is one multiply per output, amortised). On hardware, that’s:
- No floating-point multipliers needed for the projection. The entire projection is sums-of-signed-floats.
- The projection matrix fits in 1 bit per entry (sign) plus a single global scale. Memory footprint drops 32× from the Gaussian version.
- Hardware sketching circuits (FPGAs, custom ASICs) can build ±1 projections directly without floating-point units.
Sparse projections — Li 2006
Li, Hastie & Church, “Very Sparse Random Projections” (KDD 2006). The next step: most of the projection entries can be zero. The distribution becomes:
The same JL distortion bound holds. The savings:
- Faster projection. Only the nonzero entries contribute. At s = 3, you do 1/3 the additions of the dense ±1 version.
- Sparser matrix storage. Hash-map style: only nonzero positions stored.
- Streaming-friendly. Each input dimension affects only ~1/s of the output dimensions, so updates are local.
Modern sketching libraries (Apache Spark’s MinHash, the various count-sketch / count-min variants) use Li’s sparse form as the default.
Click between the three. The histograms are statistically indistinguishable. Same JL bound, vastly different cost to compute.
The projection (Px)_i = Σⱼ P_ij · x_j becomes Σⱼ (±1/√k) · x_j = (1/√k) Σⱼ ±x_j — entirely sums of signed inputs. No multiplications needed inside the projection (the 1/√k normalisation is one multiply per output, amortised).
Operational consequences:
- Hardware sketching circuits (FPGAs, custom ASICs) can implement the projection with adders only — no float multiplier units.
- The projection matrix is 1 bit per entry (sign) plus one global scale. 32× memory reduction vs Gaussian.
- For SIMD code, the projection becomes a stream of additions and subtractions — even friendlier than the FMA-based dense matmul.
Achlioptas’s paper is the canonical “one constant tweak that changes everything” result in numerical linear algebra. Modern production JL kernels almost universally use ±1 or sparser variants; the Gaussian form survives mainly in textbook proofs.
Now make it run
The kernel runs all three schemes on the same N=80, d=512, k=64 data and reports the empirical distortion statistics. They are statistically identical:
Three random projections compared
N=80 points, d=512 → k=64
scheme mean std min ratio max ratio max |1−r|
Gaussian 1.0029 0.0883 0.7321 1.2734 0.2734
Achlioptas ±1 0.9960 0.0860 0.7275 1.2795 0.2795
Sparse Achlioptas 0.9900 0.0839 0.7349 1.3053 0.3053
Standard deviation of the distortion ratio is 0.084–0.088 across all three schemes — essentially identical. Worst-case distortion ~0.28–0.30 in all three. The sparse variant uses 1/3 the operations of the dense ±1 version with no measurable loss.
}
int main(void) {
const int N = 80, d = 512, k = 64;
double* X = malloc(sizeof(double) * (size_t)N * d);
for (int i = 0; i < N * d; i++) X[i] = normal();
const char* names[] = { "Gaussian", "Achlioptas ±1", "Sparse Achlioptas" };
Scheme schemes[] = { GAUSSIAN, PM1, SPARSE };
printf("Three random projections compared\n");
printf("N=%d points, d=%d → k=%d\n\n", N, d, k);
printf("%-22s %-12s %-12s %-12s %-12s %-14s\n",
"scheme", "mean", "std", "min ratio", "max ratio", "max |1−r|");
for (int si = 0; si < 3; si++) {
double* P = malloc(sizeof(double) * (size_t)k * d);
double* Y = malloc(sizeof(double) * (size_t)N * k);
fill_projection(P, k, d, schemes[si]);
project(X, N, d, P, k, Y);
double sum = 0, sum2 = 0, lo = INFINITY, hi = 0, maxdev = 0;
long pairs = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++) {
double dX = sqrt(dist2(X + i * d, X + j * d, d));
double dY = sqrt(dist2(Y + i * k, Y + j * k, k));
double r = dY / dX;
sum += r; sum2 += r * r;
if (r < lo) lo = r;
if (r > hi) hi = r;
if (fabs(r - 1.0) > maxdev) maxdev = fabs(r - 1.0);
pairs += 1;
}
double mean = sum / pairs;Why does discrete work as well as continuous?
The high-level intuition: JL only requires the projection’s rows to be high-dimensional random vectors with bounded second moment. The proof of the JL distortion bound doesn’t actually care that the entries are Gaussian — it cares that each row of P, when dotted with a fixed test vector, has the right mean (0) and variance (≈ ‖x‖²/k) and a sub-Gaussian tail.
Achlioptas: ±1 entries have variance 1, so each row dotted with x has variance ‖x‖²/k — same as the Gaussian. The sub-Gaussian tail comes from a Hoeffding-style bound on sums of bounded variables.
Sparse: the surviving entries have higher variance (s/k instead of 1/k) to compensate for the zeros. Total contribution per row is still N(0, ‖x‖²/k) in distribution by CLT.
This is one of the cleanest “the central limit theorem rescues you” stories in ML. Whatever your projection’s entry distribution, as long as it has bounded second moment and the right variance, sums of many of them look Gaussian by Ch.5 §2’s CLT — and the JL bound applies. The Gaussian entries are nice for analysis; the discrete entries are nice for computation; they’re indistinguishable in their preservation properties.
Dense Gaussian: for each of 10M vectors, one matmul P · x with P of shape (256, 4096). Per vector: 256 · 4096 = ~10⁶ multiply-adds. Total: ~10¹³ multiply-adds.
Sparse Achlioptas (s = 3): only 1/3 of P’s entries are nonzero. Per vector: 256 · 4096 / 3 ≈ 350K operations, AND each operation is an add/subtract not a multiply-add. Total: ~3.5 × 10¹² adds — 3× fewer operations, each ~2× cheaper at the silicon level (adds are cheaper than multiply-adds). Combined ~5–6× speedup.
Plus storage: dense P needs 256 · 4096 · 4 = 4 MB (float32). Sparse P stores only nonzero entries with indices — about (256·4096/3) · 8 = ~2.7 MB (4 bytes index + 4 bytes value), but with 1-bit signs and a global scale it can drop to 256·4096/3/8 = 43 KB. ~100× smaller projection matrix.
For production-scale vector search, the sparse variant is the default for these reasons; the Gaussian version mostly survives in textbook proofs and small-scale demos.
The JL proof’s key step is showing that ‖Px‖² concentrates near ‖x‖². Each coordinate (Px)_i is a sum over j of P_ij · x_j — a linear combination of d independent random variables (the P_ij at fixed i).
By the CLT, this linear combination is approximately Gaussian for any base distribution P_ij with bounded variance and reasonable independence — the CLT doesn’t care whether P_ij is itself Gaussian. As long as:
- E[P_ij] = 0 — the projection is unbiased in expectation,
- Var(P_ij) = 1/k — the variance is normalized so E[‖Px‖²] = ‖x‖²,
- P_ij are independent across i, j — no correlations,
then ‖Px‖² has approximately the same distribution (chi-squared scaled by ‖x‖²/k) regardless of whether P_ij is Gaussian, ±1, or sparse. The same sub-Gaussian tail bound holds.
So the JL bound is robust under any “small, independent, mean-zero, fixed-variance” replacement of the projection entries. Achlioptas’s ±1 satisfies this (variance 1 / k after scaling). Li’s sparse satisfies it (entries have higher variance to compensate for the zeros, but total per-coordinate variance still equals ‖x‖²/k). This is one of the cleanest applications of the CLT to a constructive algorithm in numerical linear algebra — and the engineering reason discrete projections dominate production code.
END OF CH.7 §2 — Gaussian, ±1, and sparse projections.
Built: AchlioptasCompare viz (tabs between the three schemes, same histogram every time); achlioptas.c empirically confirms statistical equivalence. Three recall items: easy (the operational ±1 advantage), medium (sparse-vs-dense cost arithmetic), hard (the CLT-based reason discrete projections work).
Coming next: §7.3 — Fast JL via Hadamard transform. The structured rotation that drops projection cost to O(d log d) and closes the loop to TurboQuant (Ch.25).