Loss functions & empirical risk

Section 8.1

Loss functions & empirical risk

”Training a neural network” sounds elaborate. It isn’t. It’s finding the model parameters that minimise a loss function averaged over training examples. That’s the whole formal procedure — the rest is implementation. The loss function says which mistakes you care about; gradient descent (next section) says how to find the parameters that minimise the average. Get the loss wrong and the model optimises for the wrong thing — perfect MSE on outlier-contaminated data still gives a regression line that lies; cross-entropy mismatched to softmax outputs underflows; rank-aware ML systems whose loss isn’t rank-aware miss the point of ranking. The loss function is your contract with the model. This section nails down the standard losses and the empirical-risk-minimisation framework, both refreshed for an audience that learned this material when “regularisation” still mostly meant “ridge regression.”

A loss is a number per example

A loss function takes one prediction and one ground-truth label and returns a non-negative number: how wrong was that prediction? Zero is perfect; bigger is worse. The empirical risk — what you actually minimise during training — is the average loss across your training set:

R̂(θ) = (1/N) Σᵢ L( f_θ(xᵢ), yᵢ ) where: f_θ is your model with parameters θ (xᵢ, yᵢ) is training example i L is your chosen loss function

“Training” is: pick a θ that minimises R̂(θ). That’s it. The rest of this chapter — gradient descent, SGD, Adam — is how you find that minimising θ. Everything in deep learning since 2012 falls under this framework; the field is sometimes literally called empirical risk minimisation (ERM) when discussed at the theory level.

The two big losses

Mean Squared Error (MSE / L2): L(ŷ, y) = (ŷ − y)² Cross-entropy (CE): L(ŷ, y) = − Σ_c y_c · log ŷ_c (y is a one-hot, ŷ is the softmax output)

MSE is the default for regression. Three reasons it’s the canonical choice:

Smooth and differentiable. Easy to optimise with gradient methods.
Implicit Gaussian assumption. If you assume your residuals are Gaussian, MSE is exactly the negative log-likelihood — minimising MSE = maximising likelihood under a Gaussian noise model. The CLT (Ch.5 §2) is the reason this assumption is roughly right surprisingly often.
Strict convexity in the prediction. No local minima around the prediction; the loss landscape is well-behaved with respect to the output.

The downside: MSE squares the residual, so outliers dominate. One bad data point with residual 10 contributes 100 to the loss; ten typical points with residual 1 each contribute only 10 combined. MAE (mean absolute error) fixes this — its gradient magnitude is constant, so outliers don’t dominate — at the cost of being non-smooth at zero residual.

Huber loss is the practical compromise — quadratic near zero (smooth, fast convergence) and linear in the tails (robust). Default in many robust-regression libraries.

w (model parameter) w = 1.50

true w 2

current w 1.50

loss(w) 0.70

optimum w* 2.19

min loss 0.26

Three different losses, same data. MSE (squared error) is smooth and the gradient grows with the residual — outliers dominate the sum, so the optimum drifts toward them. MAE (absolute error) is robust — gradient magnitude is constant in the residual — so outliers have less pull, but the loss isn't smooth at zero residual. Huber is the practical compromise — quadratic near zero (smooth, fast convergence) and linear in the tails (robust to outliers).

The loss function is your contract with the model — what mistakes you care about and how much. Different losses optimise to different parameters on the same data. Pick the loss with the same care you pick the architecture.

Drag w; flip between MSE, MAE, Huber. Notice the optimum moves — different losses optimise to different parameters on the same data. The dataset has a few outliers (the cluster of points well off the trend line); MSE pulls toward them, MAE/Huber don’t.

— think, then check —

A loss function L(ŷ, y) takes one prediction ŷ and one ground-truth label y and returns a non-negative number — how wrong the prediction was. Training minimises the average of L over the training set.

Calling it a ‘contract’ means: the loss specifies what the model is asked to optimise. Choose MSE and the model will minimise squared error — outliers will pull it. Choose MAE and the model is robust to outliers but won’t be smooth near zero error. Choose cross-entropy and the model produces calibrated probabilities. The model does exactly what the loss tells it to. If it does the wrong thing, almost always the loss is the wrong contract — not the architecture, not the data.

↳ §8.1 loss as contract

Cross-entropy — the loss for classification

For a probabilistic classifier producing softmax outputs ŷ = softmax(z) over C classes (where z are the pre-softmax logits), the standard loss is cross-entropy:

L(ŷ, y) = − Σ_c y_c · log ŷ_c For one-hot y (single true class c*): L(ŷ, y) = − log ŷ_{c*} So the loss is just "the negative log probability the model assigned to the correct class."

Three things to know about cross-entropy:

It’s negative log-likelihood. Minimising cross-entropy = maximising the likelihood the model assigns to the training data. Same identity as MSE-vs-Gaussian-likelihood, just for categorical distributions instead of Gaussian.
Pairs naturally with softmax. The gradient of cross-entropy with respect to the pre-softmax logits is a clean ŷ − y (one-hot subtracted from probabilities) — easy to backprop. Implementations often fuse softmax and cross-entropy into one numerically-stable op (log_softmax + nll) to avoid overflow at large logits.
It heavily penalises confident wrong predictions. If the model assigns 0.001 to the true class, the loss is −\log 0.001 ≈ 6.9. If it assigns 0.99, the loss is 0.01. The relationship between probability and loss is logarithmic, so being wrong-and-confident is severely punished.

This makes cross-entropy the right loss for probability calibration — models trained on it produce probability estimates that are roughly correctly scaled, not just discriminative. (Guo et al. 2017, “On Calibration of Modern Neural Networks,” ICML — note that this is a regression: large NNs are typically over-confident even when trained on CE; calibration techniques like temperature scaling fix this post-hoc.)

Why minimise empirical risk?

You’re minimising the average loss on your training data. What you actually care about is the loss on future data — the true risk R(θ) = E[L(f_θ(x), y)] over the actual data distribution. The empirical risk is just the Monte-Carlo estimate of the true risk based on the training set you have.

By the √N law from Ch.5 §1, the empirical risk estimates the true risk with standard error σ_L/√N. So if your training set is small, your empirical risk is a noisy proxy for what you actually care about — and the parameter that minimises empirical risk is generally not the parameter that minimises true risk. The gap is generalisation error theory concept The gap between true risk R(θ) and empirical risk R̂(θ) at a particular θ. Small generalisation error = the model performs on new data roughly as well as it does on training data. Large gap = overfitting. Controlled by model capacity, training set size, regularisation, and (modern view) implicit biases of the optimiser. Then → now: the bias-variance / VC-dimension theory you learned in 2007 was the dominant framework — capacity-controlled bounds. Modern ML (especially deep learning) has weakened this framework: huge over-parameterised networks generalise WELL despite VC dimension predicting otherwise. The modern explanations involve implicit biases of SGD (Ch.8 §2-3), feature learning, and effective model capacity rather than nominal parameter count. .

This is why every training loop pairs the empirical-risk minimisation with techniques to keep the parameters from overfitting:

A held-out validation set. Compute the loss on data the model wasn’t trained on; stop training when validation loss stops improving.
Weight decay / L2 regularisation. Penalise large parameter values, biasing toward simpler solutions.
Early stopping, dropout, data augmentation, batch normalisation. Each reduces effective capacity in a different way.

We’ll see how each interacts with the optimiser in §8.3.

— think, then check —

MSE is the negative log-likelihood under a Gaussian noise model. So MSE’s optimum is the maximum-likelihood estimate IF the residuals are Gaussian. When they’re NOT — e.g., when there are heavy-tailed outliers, when the data has high-leverage points, when there are mislabelled examples — Gaussian assumption fails and MSE produces a biased fit.

MAE corresponds to a different implicit noise model: the Laplace (double-exponential) distribution, which has heavier tails than Gaussian. Minimising MAE = maximum likelihood under Laplace noise. The Laplace distribution puts more probability on large residuals, so the optimiser is less surprised by outliers — they have less influence on the fit.

Operational rule: if your data is roughly Gaussian, MSE. If you have outliers you don’t want to dominate, MAE or Huber. (Huber gets the best of both: MSE-like smoothness near the optimum, MAE-like robustness in the tails. Default choice in many robust regression libraries.)

This is one of the cleanest connections to make: every loss is implicitly an assumption about the noise distribution on your data. Pick the loss that matches what you believe about the data.

↳ §8.1 + Ch.5 §2 Gaussian/CLT

Some other losses worth knowing

A non-exhaustive but useful list:

Loss	Use case	Implicit noise assumption
MSE / L2	Regression, default	Gaussian
MAE / L1	Regression with outliers	Laplace
Huber	Robust regression with smoothness	Mixed Gaussian/Laplace
Cross-entropy	Classification with calibrated probabilities	Categorical/multinomial
Hinge	Margin classification (SVM-era)	—
Focal loss	Object detection with class imbalance	—
Triplet loss	Metric learning (anchors, positives, negatives)	—
Contrastive loss	Self-supervised pretraining	—
KL divergence	Distribution matching	—
InfoNCE	Self-supervised contrastive	Categorical with negative sampling

Modern ML pipelines often mix several losses with weighted sums — e.g., RLHF combines a policy loss with a value loss with a KL regulariser to the SFT model. The choice of which losses to combine, and with what weights, is the dominant lever for shaping behaviour.

— think, then check —

R̂ is a Monte-Carlo estimate of R from N independent samples. By the √N law (Ch.5 §1), the standard error of this estimate is σ_L / √N where σ_L is the standard deviation of the loss across the data distribution.

So |R̂(θ) − R(θ)| is typically O(σ_L / √N) at any FIXED θ.

The catch: when we MINIMISE R̂(θ) over θ, we’re searching for the θ where R̂ happens to be lowest — which may be lower than R at that same θ because of the per-θ noise. The optimal-R̂ θ is biased to over-fit the noise in the empirical loss. This bias-from-optimisation is the source of generalisation error.

Bound: a uniform convergence argument (Vapnik 1971) shows sup_θ |R̂(θ) − R(θ)| ≤ O(√(VC-dim · log(N) / N)). The capacity term (VC-dim, model complexity) appears because we’re taking a max over a model class, not a single θ.

Operational consequence: empirical training loss is a noisy proxy for true risk. To get an UNBIASED estimate of R, evaluate the model on a held-out set the optimiser never saw — that’s exactly what a validation/test set is. Early stopping based on validation loss is a practical way to use this independent estimate to avoid over-fitting the empirical loss.

Modern caveat (DL post-2015): the VC-style bounds dramatically OVER-estimate generalisation error for huge over-parameterised networks. The implicit biases of SGD and feature learning give effective generalisation much better than VC predicts. But validation sets remain the operational ground truth — every production training run validates on held-out data because the theoretical bounds aren’t reliable enough to skip the empirical check.

↳ §8.1 + Ch.5 §1 √N law

END OF CH.8 §1 — Loss functions & empirical risk.
Built: LossLandscape viz (slide w, see how the optimum moves depending on whether you optimise MSE / MAE / Huber — different losses give different fits on the same data). Three recall items: easy (loss as contract), medium (MSE vs MAE and the implicit noise assumptions), hard (the empirical-vs-true-risk gap derived from the √N law).
Coming next: §8.2 — Gradient descent and stochastic gradient descent. How to actually find the θ that minimises empirical risk.