Gaussian, ±1, and sparse projections

Section 7.2

Gaussian, ±1, and sparse projections

§7.1 stated the JL lemma with a Gaussian random projection: Pᵢⱼ ∼ N(0, 1/k). Beautiful theorem, but operationally awkward — you need a Gaussian RNG, you need 32 bits per matrix entry, and the projection step is a dense matmul of float entries. Achlioptas (2003) proved the entries can be ±1/√k uniformly — no multiplications needed in the projection, just additions and subtractions. Li, Hastie & Church (2006) made the projection sparse: 2/3 of the entries can be zero, with the surviving ones scaled up to compensate. Both deliver the same distortion bound as the Gaussian. The lesson — and the reason every production JL kernel uses one of the discrete forms — is that the smoothness of the random entries doesn’t matter; only their independence and second moment do.

Achlioptas’s ±1 projection

Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins” (PODS 2001, JCSS 2003). Replace the Gaussian distribution over projection entries with a discrete one:

Achlioptas projection: Pᵢⱼ ∈ { +1/√k, −1/√k } uniformly, independent (no multiplications during the projection — every entry just contributes ±x_j / √k)

The proof tracks the Gaussian proof exactly, with the second moment of a Bernoulli ±1 (= 1) substituting for the Gaussian second moment. Achlioptas’s lemma theorem The ±1/√k random projection (entries uniform on {+1/√k, −1/√k}) gives the same JL distortion bound as the Gaussian projection. Discovered by Achlioptas 2001/2003. Operationally crucial because the projection becomes 'sum/subtract' instead of 'multiply-add' — entirely integer-friendly hardware path. Then → now: Achlioptas's paper is one of those rare 'just one constant tweak' results that completely changes the practical landscape. By 2010 every JL kernel had switched to ±1; by 2020 production sketching libraries (count-sketch, AMS, etc.) used the same ±1 trick. The Gaussian version survives mostly in textbook proofs. gives:

Theorem (Achlioptas 2003): the ±1 projection above satisfies the same JL distortion bound as the Gaussian, with the same k = O(log N / ε²) and the same exponential tail probability.

The operational difference is dramatic. Computing (Px)_i = Σⱼ ±x_j/√k requires only additions and subtractions (the 1/√k scaling is one multiply per output, amortised). On hardware, that’s:

No floating-point multipliers needed for the projection. The entire projection is sums-of-signed-floats.
The projection matrix fits in 1 bit per entry (sign) plus a single global scale. Memory footprint drops 32× from the Gaussian version.
Hardware sketching circuits (FPGAs, custom ASICs) can build ±1 projections directly without floating-point units.

Sparse projections — Li 2006

Li, Hastie & Church, “Very Sparse Random Projections” (KDD 2006). The next step: most of the projection entries can be zero. The distribution becomes:

Sparse Achlioptas: Pᵢⱼ = { +√(s/k) with prob 1/(2s) 0 with prob 1 − 1/s −√(s/k) with prob 1/(2s) } The "sparsity parameter" s controls how many entries are zero. For s = 3: 2/3 of entries are zero, surviving entries scaled by √3. For s = √d / log(d): even sparser, with O(√d) nonzero entries per row.

The same JL distortion bound holds. The savings:

Faster projection. Only the nonzero entries contribute. At s = 3, you do 1/3 the additions of the dense ±1 version.
Sparser matrix storage. Hash-map style: only nonzero positions stored.
Streaming-friendly. Each input dimension affects only ~1/s of the output dimensions, so updates are local.

Modern sketching libraries (Apache Spark’s MinHash, the various count-sketch / count-min variants) use Li’s sparse form as the default.

projected dim k k = 64

P_ij ~ N(0, 1/k) — the classical Johnson-Lindenstrauss projection

mean ratio 0.9996

std ratio 0.0841

predicted 1/√k 0.1250

max |1−r| 0.2988

d 400

N points 50

Flip between the three schemes. The histograms are statistically indistinguishable — Achlioptas's {±1} projection (no multiplies!) and Li's sparse variant (most entries zero) both preserve distances as well as the Gaussian. The smoothness of the entries doesn't matter; the independence does. That's why production code uses the discrete versions whenever possible — they're cheaper to apply and friendlier to integer hardware.

Pairwise-distance-ratio histograms for three random-projection schemes, on the same N = 50 points in d = 400. Standard deviation and worst-case distortion are essentially identical. Achlioptas (2003) and Li (2006) proved this rigorously; the result is one of the cleanest "the simple construction is also the optimal one" results in numerical linear algebra.

Click between the three. The histograms are statistically indistinguishable. Same JL bound, vastly different cost to compute.

— think, then check —

The projection (Px)_i = Σⱼ P_ij · x_j becomes Σⱼ (±1/√k) · x_j = (1/√k) Σⱼ ±x_j — entirely sums of signed inputs. No multiplications needed inside the projection (the 1/√k normalisation is one multiply per output, amortised).

Operational consequences:

Hardware sketching circuits (FPGAs, custom ASICs) can implement the projection with adders only — no float multiplier units.
The projection matrix is 1 bit per entry (sign) plus one global scale. 32× memory reduction vs Gaussian.
For SIMD code, the projection becomes a stream of additions and subtractions — even friendlier than the FMA-based dense matmul.

Achlioptas’s paper is the canonical “one constant tweak that changes everything” result in numerical linear algebra. Modern production JL kernels almost universally use ±1 or sparser variants; the Gaussian form survives mainly in textbook proofs.

↳ §7.2 Achlioptas

Now make it run

The kernel runs all three schemes on the same N=80, d=512, k=64 data and reports the empirical distortion statistics. They are statistically identical:

Three random projections compared
N=80 points, d=512 → k=64

scheme                 mean       std        min ratio   max ratio   max |1−r|
Gaussian               1.0029     0.0883     0.7321      1.2734      0.2734
Achlioptas ±1          0.9960     0.0860     0.7275      1.2795      0.2795
Sparse Achlioptas      0.9900     0.0839     0.7349      1.3053      0.3053

Standard deviation of the distortion ratio is 0.084–0.088 across all three schemes — essentially identical. Worst-case distortion ~0.28–0.30 in all three. The sparse variant uses 1/3 the operations of the dense ±1 version with no measurable loss.

achlioptas.c (loop) C · three schemes side by side

}

int main(void) {
    const int N = 80, d = 512, k = 64;
    double* X = malloc(sizeof(double) * (size_t)N * d);
    for (int i = 0; i < N * d; i++) X[i] = normal();

    const char* names[] = { "Gaussian", "Achlioptas ±1", "Sparse Achlioptas" };
    Scheme schemes[] = { GAUSSIAN, PM1, SPARSE };

    printf("Three random projections compared\n");
    printf("N=%d points, d=%d → k=%d\n\n", N, d, k);
    printf("%-22s %-12s %-12s %-12s %-12s %-14s\n",
           "scheme", "mean", "std", "min ratio", "max ratio", "max |1−r|");

    for (int si = 0; si < 3; si++) {
        double* P = malloc(sizeof(double) * (size_t)k * d);
        double* Y = malloc(sizeof(double) * (size_t)N * k);
        fill_projection(P, k, d, schemes[si]);
        project(X, N, d, P, k, Y);

        double sum = 0, sum2 = 0, lo = INFINITY, hi = 0, maxdev = 0;
        long pairs = 0;
        for (int i = 0; i < N; i++)
            for (int j = i + 1; j < N; j++) {
                double dX = sqrt(dist2(X + i * d, X + j * d, d));
                double dY = sqrt(dist2(Y + i * k, Y + j * k, k));
                double r = dY / dX;
                sum += r; sum2 += r * r;
                if (r < lo) lo = r;
                if (r > hi) hi = r;
                if (fabs(r - 1.0) > maxdev) maxdev = fabs(r - 1.0);
                pairs += 1;
            }
        double mean = sum / pairs;

Why does discrete work as well as continuous?

The high-level intuition: JL only requires the projection’s rows to be high-dimensional random vectors with bounded second moment. The proof of the JL distortion bound doesn’t actually care that the entries are Gaussian — it cares that each row of P, when dotted with a fixed test vector, has the right mean (0) and variance (≈ ‖x‖²/k) and a sub-Gaussian tail.

Achlioptas: ±1 entries have variance 1, so each row dotted with x has variance ‖x‖²/k — same as the Gaussian. The sub-Gaussian tail comes from a Hoeffding-style bound on sums of bounded variables.

Sparse: the surviving entries have higher variance (s/k instead of 1/k) to compensate for the zeros. Total contribution per row is still N(0, ‖x‖²/k) in distribution by CLT.

This is one of the cleanest “the central limit theorem rescues you” stories in ML. Whatever your projection’s entry distribution, as long as it has bounded second moment and the right variance, sums of many of them look Gaussian by Ch.5 §2’s CLT — and the JL bound applies. The Gaussian entries are nice for analysis; the discrete entries are nice for computation; they’re indistinguishable in their preservation properties.

— think, then check —

Dense Gaussian: for each of 10M vectors, one matmul P · x with P of shape (256, 4096). Per vector: 256 · 4096 = ~10⁶ multiply-adds. Total: ~10¹³ multiply-adds.

Sparse Achlioptas (s = 3): only 1/3 of P’s entries are nonzero. Per vector: 256 · 4096 / 3 ≈ 350K operations, AND each operation is an add/subtract not a multiply-add. Total: ~3.5 × 10¹² adds — 3× fewer operations, each ~2× cheaper at the silicon level (adds are cheaper than multiply-adds). Combined ~5–6× speedup.

Plus storage: dense P needs 256 · 4096 · 4 = 4 MB (float32). Sparse P stores only nonzero entries with indices — about (256·4096/3) · 8 = ~2.7 MB (4 bytes index + 4 bytes value), but with 1-bit signs and a global scale it can drop to 256·4096/3/8 = 43 KB. ~100× smaller projection matrix.

For production-scale vector search, the sparse variant is the default for these reasons; the Gaussian version mostly survives in textbook proofs and small-scale demos.

↳ §7.2 sparse + §7.1 cost analysis

— think, then check —

The JL proof’s key step is showing that ‖Px‖² concentrates near ‖x‖². Each coordinate (Px)_i is a sum over j of P_ij · x_j — a linear combination of d independent random variables (the P_ij at fixed i).

By the CLT, this linear combination is approximately Gaussian for any base distribution P_ij with bounded variance and reasonable independence — the CLT doesn’t care whether P_ij is itself Gaussian. As long as:

E[P_ij] = 0 — the projection is unbiased in expectation,
Var(P_ij) = 1/k — the variance is normalized so E[‖Px‖²] = ‖x‖²,
P_ij are independent across i, j — no correlations,

then ‖Px‖² has approximately the same distribution (chi-squared scaled by ‖x‖²/k) regardless of whether P_ij is Gaussian, ±1, or sparse. The same sub-Gaussian tail bound holds.

So the JL bound is robust under any “small, independent, mean-zero, fixed-variance” replacement of the projection entries. Achlioptas’s ±1 satisfies this (variance 1 / k after scaling). Li’s sparse satisfies it (entries have higher variance to compensate for the zeros, but total per-coordinate variance still equals ‖x‖²/k). This is one of the cleanest applications of the CLT to a constructive algorithm in numerical linear algebra — and the engineering reason discrete projections dominate production code.

↳ §7.2 + Ch.5 §2 CLT

END OF CH.7 §2 — Gaussian, ±1, and sparse projections.
Built: AchlioptasCompare viz (tabs between the three schemes, same histogram every time); achlioptas.c empirically confirms statistical equivalence. Three recall items: easy (the operational ±1 advantage), medium (sparse-vs-dense cost arithmetic), hard (the CLT-based reason discrete projections work).
Coming next: §7.3 — Fast JL via Hadamard transform. The structured rotation that drops projection cost to O(d log d) and closes the loop to TurboQuant (Ch.25).