The Gaussian and the CLT
Across the dozens of distributions in any probability text, one shows up obsessively in ML — the Gaussian, also called the normal. It’s the prior you’d guess if you knew nothing else; it’s the limit law of sums of unrelated quantities; it’s what your activations look like after good initialisation. Two structural facts explain why: (1) sums of independent Gaussians are Gaussian, and (2) the Central Limit Theorem says sums of any independent finite-variance things converge to Gaussian. Combine those and you get a universal attractor: anywhere a quantity is the sum of many small independent contributions, the marginal distribution will be Gaussian-shaped regardless of where the contributions came from. This is also why N(0, 1) is the reference distribution — it’s a fixed point.
The normal distribution
The normal distribution N(μ, σ²) has density:
The standard normal N(0, 1) is the reference: any other normal Z ∼ N(μ, σ²) can be standardised by (Z − μ)/σ ∼ N(0, 1), making distributional comparisons direct. That’s the same trick you do when you compute a z-score.
Three structural properties make it special:
- Maximum entropy for fixed variance. Among all distributions with mean μ and variance σ², the Gaussian has the highest entropy. In information-theoretic terms, the Gaussian is the least committal assumption you can make about a random quantity whose mean and variance you’ve measured. (Cover & Thomas, “Elements of Information Theory,” ch. 8.) Whenever you’re tempted to say “I’ll model this as Gaussian because I don’t know what else to do,” that’s actually a principled choice — it’s the maximum-entropy prior given what you know.
- Stable under summation. If X ∼ N(μ_X, σ_X²) and Y ∼ N(μ_Y, σ_Y²) are independent, then X + Y ∼ N(μ_X + μ_Y, σ_X² + σ_Y²). The means add, the variances add — and the sum is still Gaussian. Few families have this closure property; most distributions don’t preserve their family under addition.
- Linear maps stay Gaussian. If X ∼ N(μ, Σ) in multiple dimensions and A is a matrix (a linear function — Ch.2 §1), then AX ∼ N(Aμ, AΣAᵀ). Apply any linear function to a Gaussian, get a Gaussian. The covariance transforms by the conjugation formula AΣAᵀ — which we’ll return to in §5.3 when discussing isotropic vs anisotropic.
Why this matters for ML — the universal-attractor argument
A trained neural network’s activation at a given layer is roughly:
A sum over many inputs, weighted by the matrix W. If the inputs are roughly independent and each weight contribution is small, the sum will look Gaussian by the CLT — even if the individual input_j distributions are wild. This is why Xavier and He initialisation work: they pick the weight variance so that the sum Σⱼ Wᵢⱼ · input_j ends up with unit variance, by the variance-of-sums-of-independents formula from §5.1. The CLT then takes care of making the activations Gaussian-shaped.
The Central Limit Theorem
Stated formally:
The conditions matter:
- i.i.d. (independent and identically distributed). Independence rules out cases where samples are strongly correlated. There’s a CLT variant for weakly-dependent sequences (“Lyapunov CLT,” “Lindeberg-Lévy under mixing conditions”) but the i.i.d. version is enough for the ML applications you care about.
- Finite variance. If the base distribution has infinite variance (e.g., Cauchy, with PDF 1/π(1 + x²)), the CLT does not apply. Sums of independent Cauchys are still Cauchy, not Gaussian. Heavy-tailed phenomena — financial returns, file-size distributions, transformer attention scores — sometimes look more Cauchy than Gaussian, and reasoning about them as if the CLT applied gives wrong answers.
The viz makes the convergence visible. Pick a base distribution; slide k from 1 up. At k = 1 the histogram is the base distribution. By k = 5 bimodal is already vaguely bell-shaped. By k = 20, even the aggressive bimodal at ±3 is visually indistinguishable from N(0, 1).
The convergence rate is itself O(1/√k), the same √k that shows up in §5.1’s standard-error law. By the Berry-Esseen theorem, the Kolmogorov-Smirnov distance between the z-scored sum’s CDF and Φ(z) (the standard normal CDF) shrinks as 1/√k. The bimodal at ±3 converges fast because it’s already symmetric; an asymmetric base would need more k to look Gaussian.
(1) Maximum entropy for fixed variance. Among all distributions with given mean and variance, the Gaussian has highest entropy. → ML consequence: it’s the default “I don’t know” prior, justifying its use in variational inference, Bayesian deep learning, and as the noise model in regression.
(2) Stable under summation. Sum of independent Gaussians is Gaussian; means and variances both add. → ML consequence: when you concatenate or sum independent activations (residual streams, skip connections), Gaussian-shaped distributions stay Gaussian-shaped.
(3) Stable under linear maps. AX where X is Gaussian is Gaussian, with covariance AΣAᵀ. → ML consequence: a linear layer applied to a Gaussian input stays Gaussian. This is what makes the activation-distribution analysis underlying Xavier/He initialisation tractable.
Now make it run
The kernel sums k samples from a deliberately non-Gaussian base (bimodal at ±3 with tiny spread). It computes the excess kurtosis of the sum at increasing values of k and watches it decay toward zero — the Gaussian’s signature.
"k", "mean", "var", "skew", "excess kurt");
int ks[] = { 1, 2, 4, 8, 16, 32, 64 };
for (size_t kk = 0; kk < sizeof(ks) / sizeof(ks[0]); kk++) {
int k = ks[kk];
for (long i = 0; i < N_SUMS; i++) {
double s = 0.0;
for (int j = 0; j < k; j++) s += base_sample();
sums[i] = s;
}
double mean, var, skew, kurt;
moments(sums, N_SUMS, &mean, &var, &skew, &kurt);
printf("%-6d %-12.4f %-12.4f %-12.4f %-12.4f\n",
k, mean, var, skew, kurt);
}
printf("\nVar(sum) grows linearly in k (variance of independent sums adds).\n");The output is striking:
CLT on a bimodal {±3} base distribution
base sample skew ≈ 0 (symmetric); base sample excess kurtosis ≈ -2 (flatter than Gaussian)
k mean var skew excess kurt
1 0.0036 10.5845 -0.0021 -1.9922
2 -0.0194 21.1606 0.0055 -0.9948
4 0.0040 42.2010 -0.0018 -0.5037
8 0.0298 84.7152 -0.0031 -0.2554
16 -0.0556 169.9084 0.0084 -0.1100
32 -0.0296 343.3307 0.0037 -0.0523
64 0.0370 681.9793 0.0114 -0.0272
Var(sum) grows linearly in k (variance of independent sums adds).
Excess kurtosis decays as 1/k (skew as 1/√k).
By k = 64 the sum is indistinguishable from a Gaussian within sampling noise.
Two things to notice. Variance doubles with each doubling of k (10.58 → 21.16 → 42.20 → …) — that’s the variance-of-independents-add rule. Excess kurtosis halves with each doubling of k (−1.99 → −0.99 → −0.50 → …) — exactly the predicted 1/k decay rate. By k = 64 the excess kurtosis is below 0.03, well within sampling noise of 0. The bimodal has become Gaussian by being summed.
Let X₁, X₂, …, Xₙ be independent and identically distributed (i.i.d.) random variables with E[Xᵢ] = μ and Var(Xᵢ) = σ² < ∞. Then the z-scored sum
(Σ Xᵢ − nμ) / (σ √n) → N(0, 1) in distribution, as n → ∞.
The two conditions:
(1) i.i.d. The samples are independent of one another and drawn from the same distribution.
(2) Finite variance. σ² < ∞. (Heavy-tailed distributions like Cauchy violate this and the conclusion fails — sums of Cauchys are Cauchy, not Gaussian.)
Without independence (or some form of weak dependence) the limit can fail entirely. Without finite variance the limit may exist but is a stable distribution other than Gaussian (Lévy distributions in general). In practice both conditions hold for almost everything that comes out of an ML pipeline, so the CLT gets used everywhere.
By the CLT, a sum of many roughly-independent random variables (each with finite variance) is approximately Gaussian. y_i is a weighted sum of fan_in input dimensions, so for any reasonable nonlinear-input distribution, y_i ≈ N(0, σ_y²) for some σ_y².
By the variance-of-independents-add rule: Var(y_i) = Σⱼ W_ij² · Var(x_j) ≈ fan_in · Var(W) · Var(x). Setting this equal to Var(x) keeps the variance stable layer to layer:
fan_in · Var(W) ≈ 1 → Var(W) ≈ 1 / fan_in.
Glorot/Xavier averaged the forward stability condition with the backward stability condition (which gives 1/fan_out) and picked Var(W) = 2/(fan_in + fan_out). He initialisation refines this for ReLU activations, which halve the post-nonlinearity variance, suggesting Var(W) = 2/fan_in.
The CLT does the work of saying “the linear combination will be Gaussian-shaped”; the variance rule does the work of saying “the spread will be unit if you pick the weight variance correctly.” Together they made training nets deeper than ~10 layers tractable. Both rely directly on §5.1’s foundations — linearity of variance for independents, and the √N / CLT laws.
END OF CH.5 §2 — The Gaussian and the CLT.
Built: CLT viz (slide k to watch any base distribution’s sum become Gaussian); clt.c measures excess kurtosis decaying as 1/k under summation, confirming the CLT rate empirically. Three recall items spanning the Gaussian’s special properties, the CLT statement and conditions, and the Xavier-initialisation derivation.
Coming next: §5.3 — Isotropic vs anisotropic distributions. The covariance picture and why rotation-based quantization (Ch.25) works at all.