The Gaussian and the CLT

Section 5.2

The Gaussian and the CLT

Across the dozens of distributions in any probability text, one shows up obsessively in ML — the Gaussian, also called the normal. It’s the prior you’d guess if you knew nothing else; it’s the limit law of sums of unrelated quantities; it’s what your activations look like after good initialisation. Two structural facts explain why: (1) sums of independent Gaussians are Gaussian, and (2) the Central Limit Theorem says sums of any independent finite-variance things converge to Gaussian. Combine those and you get a universal attractor: anywhere a quantity is the sum of many small independent contributions, the marginal distribution will be Gaussian-shaped regardless of where the contributions came from. This is also why N(0, 1) is the reference distribution — it’s a fixed point.

The normal distribution

The normal distribution core distribution N(μ, σ²) — the Gaussian distribution with mean μ and variance σ². PDF: (1 / σ√(2π)) · exp(−(x − μ)² / 2σ²). The famous bell curve. Standard normal: N(0, 1). Then → now: invented as the 'normal curve of errors' by Gauss circa 1809. Modern ML loves it because (a) it's the maximum-entropy distribution for fixed variance — the least biased guess — (b) it's stable under summation and linear maps, and (c) the CLT makes it a universal attractor. N(μ, σ²) has density:

p(x) = ( 1 / (σ √(2π)) ) · exp( −(x − μ)² / (2 σ²) ) E[X] = μ Var(X) = σ²

The standard normal N(0, 1) is the reference: any other normal Z ∼ N(μ, σ²) can be standardised by (Z − μ)/σ ∼ N(0, 1), making distributional comparisons direct. That’s the same trick you do when you compute a z-score.

Three structural properties make it special:

Maximum entropy for fixed variance. Among all distributions with mean μ and variance σ², the Gaussian has the highest entropy. In information-theoretic terms, the Gaussian is the least committal assumption you can make about a random quantity whose mean and variance you’ve measured. (Cover & Thomas, “Elements of Information Theory,” ch. 8.) Whenever you’re tempted to say “I’ll model this as Gaussian because I don’t know what else to do,” that’s actually a principled choice — it’s the maximum-entropy prior given what you know.
Stable under summation. If X ∼ N(μ_X, σ_X²) and Y ∼ N(μ_Y, σ_Y²) are independent, then X + Y ∼ N(μ_X + μ_Y, σ_X² + σ_Y²). The means add, the variances add — and the sum is still Gaussian. Few families have this closure property; most distributions don’t preserve their family under addition.
Linear maps stay Gaussian. If X ∼ N(μ, Σ) in multiple dimensions and A is a matrix (a linear function — Ch.2 §1), then AX ∼ N(Aμ, AΣAᵀ). Apply any linear function to a Gaussian, get a Gaussian. The covariance transforms by the conjugation formula AΣAᵀ — which we’ll return to in §5.3 when discussing isotropic vs anisotropic.

Why this matters for ML — the universal-attractor argument

A trained neural network’s activation at a given layer is roughly:

activation_i = Σⱼ Wᵢⱼ · input_j + bias_i (before nonlinearity)

A sum over many inputs, weighted by the matrix W. If the inputs are roughly independent and each weight contribution is small, the sum will look Gaussian by the CLT — even if the individual input_j distributions are wild. This is why Xavier standard practice Initialisation scheme by Glorot & Bengio 2010 that picks weight variance Var(W) = 2/(fan_in + fan_out) so that activations and gradients keep approximately unit variance through every layer at init. The whole argument rests on summing many small independent contributions and applying the CLT. Then → now: pre-Xavier (~2010), networks deeper than ~10 layers wouldn't train because activations either vanished or exploded. The Xavier choice — derived precisely from the CLT — was the unlock that made training deep nets routine. Successors (He, LeCun) use the same argument with different nonlinearity assumptions. and He initialisation work: they pick the weight variance so that the sum Σⱼ Wᵢⱼ · input_j ends up with unit variance, by the variance-of-sums-of-independents formula from §5.1. The CLT then takes care of making the activations Gaussian-shaped.

The Central Limit Theorem

Stated formally:

Let X₁, X₂, …, Xₙ be i.i.d. random variables with E[Xᵢ] = μ and Var(Xᵢ) = σ² < ∞. Then ( (Σᵢ Xᵢ) − nμ ) / (σ √n) → N(0, 1) in distribution, as n → ∞. Equivalently: the sample mean x̄ₙ has (x̄ₙ − μ) · √n / σ → N(0, 1).

The conditions matter:

i.i.d. (independent and identically distributed). Independence rules out cases where samples are strongly correlated. There’s a CLT variant for weakly-dependent sequences (“Lyapunov CLT,” “Lindeberg-Lévy under mixing conditions”) but the i.i.d. version is enough for the ML applications you care about.
Finite variance. If the base distribution has infinite variance (e.g., Cauchy, with PDF 1/π(1 + x²)), the CLT does not apply. Sums of independent Cauchys are still Cauchy, not Gaussian. Heavy-tailed phenomena — financial returns, file-size distributions, transformer attention scores — sometimes look more Cauchy than Gaussian, and reasoning about them as if the CLT applied gives wrong answers.

The viz makes the convergence visible. Pick a base distribution; slide k from 1 up. At k = 1 the histogram is the base distribution. By k = 5 bimodal is already vaguely bell-shaped. By k = 20, even the aggressive bimodal at ±3 is visually indistinguishable from N(0, 1).

sum k samples: k = 1

k 1

E[sum] 0.000

Var(sum) 1.000

σ(sum) 1.000

For any base distribution with finite variance, the z-scored sum Z_k = (Σ Xᵢ − kμ) / (σ √k) converges to N(0, 1) as k grows. At k = 1 the histogram is the base distribution's shape; by k = 20 most distributions are visually Gaussian; by k = 50 the agreement is tight to the eye even for the aggressive bimodal.

The Central Limit Theorem in motion: sum k independent samples from any finite-variance distribution and the (z-scored) sum is Gaussian for large k. The convergence rate is also 1/√k; the more badly the base distribution differs from a Gaussian (heavier tails, bimodal, skewed), the more k you need.

The convergence rate is itself O(1/√k), the same √k that shows up in §5.1’s standard-error law. By the Berry-Esseen theorem, the Kolmogorov-Smirnov distance between the z-scored sum’s CDF and Φ(z) (the standard normal CDF) shrinks as 1/√k. The bimodal at ±3 converges fast because it’s already symmetric; an asymmetric base would need more k to look Gaussian.

— think, then check —

(1) Maximum entropy for fixed variance. Among all distributions with given mean and variance, the Gaussian has highest entropy. → ML consequence: it’s the default “I don’t know” prior, justifying its use in variational inference, Bayesian deep learning, and as the noise model in regression.

(2) Stable under summation. Sum of independent Gaussians is Gaussian; means and variances both add. → ML consequence: when you concatenate or sum independent activations (residual streams, skip connections), Gaussian-shaped distributions stay Gaussian-shaped.

(3) Stable under linear maps. AX where X is Gaussian is Gaussian, with covariance AΣAᵀ. → ML consequence: a linear layer applied to a Gaussian input stays Gaussian. This is what makes the activation-distribution analysis underlying Xavier/He initialisation tractable.

↳ §5.2 Gaussian properties

Now make it run

The kernel sums k samples from a deliberately non-Gaussian base (bimodal at ±3 with tiny spread). It computes the excess kurtosis of the sum at increasing values of k and watches it decay toward zero — the Gaussian’s signature.

clt.c (loop) C · CLT, measured by excess kurtosis

           "k", "mean", "var", "skew", "excess kurt");

    int ks[] = { 1, 2, 4, 8, 16, 32, 64 };
    for (size_t kk = 0; kk < sizeof(ks) / sizeof(ks[0]); kk++) {
        int k = ks[kk];
        for (long i = 0; i < N_SUMS; i++) {
            double s = 0.0;
            for (int j = 0; j < k; j++) s += base_sample();
            sums[i] = s;
        }
        double mean, var, skew, kurt;
        moments(sums, N_SUMS, &mean, &var, &skew, &kurt);
        printf("%-6d %-12.4f %-12.4f %-12.4f %-12.4f\n",
               k, mean, var, skew, kurt);
    }

    printf("\nVar(sum) grows linearly in k (variance of independent sums adds).\n");

The output is striking:

CLT on a bimodal {±3} base distribution
base sample skew ≈ 0 (symmetric); base sample excess kurtosis ≈ -2 (flatter than Gaussian)

k      mean         var          skew         excess kurt
1      0.0036       10.5845      -0.0021      -1.9922
2      -0.0194      21.1606      0.0055       -0.9948
4      0.0040       42.2010      -0.0018      -0.5037
8      0.0298       84.7152      -0.0031      -0.2554
16     -0.0556      169.9084     0.0084       -0.1100
32     -0.0296      343.3307     0.0037       -0.0523
64     0.0370       681.9793     0.0114       -0.0272

Var(sum) grows linearly in k (variance of independent sums adds).
Excess kurtosis decays as 1/k (skew as 1/√k).
By k = 64 the sum is indistinguishable from a Gaussian within sampling noise.

Two things to notice. Variance doubles with each doubling of k (10.58 → 21.16 → 42.20 → …) — that’s the variance-of-independents-add rule. Excess kurtosis halves with each doubling of k (−1.99 → −0.99 → −0.50 → …) — exactly the predicted 1/k decay rate. By k = 64 the excess kurtosis is below 0.03, well within sampling noise of 0. The bimodal has become Gaussian by being summed.

— think, then check —

Let X₁, X₂, …, Xₙ be independent and identically distributed (i.i.d.) random variables with E[Xᵢ] = μ and Var(Xᵢ) = σ² < ∞. Then the z-scored sum

(Σ Xᵢ − nμ) / (σ √n) → N(0, 1) in distribution, as n → ∞.

The two conditions:

(1) i.i.d. The samples are independent of one another and drawn from the same distribution.

(2) Finite variance. σ² < ∞. (Heavy-tailed distributions like Cauchy violate this and the conclusion fails — sums of Cauchys are Cauchy, not Gaussian.)

Without independence (or some form of weak dependence) the limit can fail entirely. Without finite variance the limit may exist but is a stable distribution other than Gaussian (Lévy distributions in general). In practice both conditions hold for almost everything that comes out of an ML pipeline, so the CLT gets used everywhere.

↳ §5.2 CLT statement

— think, then check —

By the CLT, a sum of many roughly-independent random variables (each with finite variance) is approximately Gaussian. y_i is a weighted sum of fan_in input dimensions, so for any reasonable nonlinear-input distribution, y_i ≈ N(0, σ_y²) for some σ_y².

By the variance-of-independents-add rule: Var(y_i) = Σⱼ W_ij² · Var(x_j) ≈ fan_in · Var(W) · Var(x). Setting this equal to Var(x) keeps the variance stable layer to layer:

fan_in · Var(W) ≈ 1 → Var(W) ≈ 1 / fan_in.

Glorot/Xavier averaged the forward stability condition with the backward stability condition (which gives 1/fan_out) and picked Var(W) = 2/(fan_in + fan_out). He initialisation refines this for ReLU activations, which halve the post-nonlinearity variance, suggesting Var(W) = 2/fan_in.

The CLT does the work of saying “the linear combination will be Gaussian-shaped”; the variance rule does the work of saying “the spread will be unit if you pick the weight variance correctly.” Together they made training nets deeper than ~10 layers tractable. Both rely directly on §5.1’s foundations — linearity of variance for independents, and the √N / CLT laws.

↳ §5.2 + §5.1 variance of sums

END OF CH.5 §2 — The Gaussian and the CLT.
Built: CLT viz (slide k to watch any base distribution’s sum become Gaussian); clt.c measures excess kurtosis decaying as 1/k under summation, confirming the CLT rate empirically. Three recall items spanning the Gaussian’s special properties, the CLT statement and conditions, and the Xavier-initialisation derivation.
Coming next: §5.3 — Isotropic vs anisotropic distributions. The covariance picture and why rotation-based quantization (Ch.25) works at all.