Loss functions & empirical risk
”Training a neural network” sounds elaborate. It isn’t. It’s finding the model parameters that minimise a loss function averaged over training examples. That’s the whole formal procedure — the rest is implementation. The loss function says which mistakes you care about; gradient descent (next section) says how to find the parameters that minimise the average. Get the loss wrong and the model optimises for the wrong thing — perfect MSE on outlier-contaminated data still gives a regression line that lies; cross-entropy mismatched to softmax outputs underflows; rank-aware ML systems whose loss isn’t rank-aware miss the point of ranking. The loss function is your contract with the model. This section nails down the standard losses and the empirical-risk-minimisation framework, both refreshed for an audience that learned this material when “regularisation” still mostly meant “ridge regression.”
A loss is a number per example
A loss function takes one prediction and one ground-truth label and returns a non-negative number: how wrong was that prediction? Zero is perfect; bigger is worse. The empirical risk — what you actually minimise during training — is the average loss across your training set:
“Training” is: pick a θ that minimises R̂(θ). That’s it. The rest of this chapter — gradient descent, SGD, Adam — is how you find that minimising θ. Everything in deep learning since 2012 falls under this framework; the field is sometimes literally called empirical risk minimisation (ERM) when discussed at the theory level.
The two big losses
MSE is the default for regression. Three reasons it’s the canonical choice:
- Smooth and differentiable. Easy to optimise with gradient methods.
- Implicit Gaussian assumption. If you assume your residuals are Gaussian, MSE is exactly the negative log-likelihood — minimising MSE = maximising likelihood under a Gaussian noise model. The CLT (Ch.5 §2) is the reason this assumption is roughly right surprisingly often.
- Strict convexity in the prediction. No local minima around the prediction; the loss landscape is well-behaved with respect to the output.
The downside: MSE squares the residual, so outliers dominate. One bad data point with residual 10 contributes 100 to the loss; ten typical points with residual 1 each contribute only 10 combined. MAE (mean absolute error) fixes this — its gradient magnitude is constant, so outliers don’t dominate — at the cost of being non-smooth at zero residual.
Huber loss is the practical compromise — quadratic near zero (smooth, fast convergence) and linear in the tails (robust). Default in many robust-regression libraries.
Drag w; flip between MSE, MAE, Huber. Notice the optimum moves — different losses optimise to different parameters on the same data. The dataset has a few outliers (the cluster of points well off the trend line); MSE pulls toward them, MAE/Huber don’t.
A loss function L(ŷ, y) takes one prediction ŷ and one ground-truth label y and returns a non-negative number — how wrong the prediction was. Training minimises the average of L over the training set.
Calling it a ‘contract’ means: the loss specifies what the model is asked to optimise. Choose MSE and the model will minimise squared error — outliers will pull it. Choose MAE and the model is robust to outliers but won’t be smooth near zero error. Choose cross-entropy and the model produces calibrated probabilities. The model does exactly what the loss tells it to. If it does the wrong thing, almost always the loss is the wrong contract — not the architecture, not the data.
Cross-entropy — the loss for classification
For a probabilistic classifier producing softmax outputs ŷ = softmax(z) over C classes (where z are the pre-softmax logits), the standard loss is cross-entropy:
Three things to know about cross-entropy:
- It’s negative log-likelihood. Minimising cross-entropy = maximising the likelihood the model assigns to the training data. Same identity as MSE-vs-Gaussian-likelihood, just for categorical distributions instead of Gaussian.
- Pairs naturally with softmax. The gradient of cross-entropy with respect to the pre-softmax logits is a clean ŷ − y (one-hot subtracted from probabilities) — easy to backprop. Implementations often fuse softmax and cross-entropy into one numerically-stable op (
log_softmax + nll) to avoid overflow at large logits. - It heavily penalises confident wrong predictions. If the model assigns 0.001 to the true class, the loss is −\log 0.001 ≈ 6.9. If it assigns 0.99, the loss is 0.01. The relationship between probability and loss is logarithmic, so being wrong-and-confident is severely punished.
This makes cross-entropy the right loss for probability calibration — models trained on it produce probability estimates that are roughly correctly scaled, not just discriminative. (Guo et al. 2017, “On Calibration of Modern Neural Networks,” ICML — note that this is a regression: large NNs are typically over-confident even when trained on CE; calibration techniques like temperature scaling fix this post-hoc.)
Why minimise empirical risk?
You’re minimising the average loss on your training data. What you actually care about is the loss on future data — the true risk R(θ) = E[L(f_θ(x), y)] over the actual data distribution. The empirical risk is just the Monte-Carlo estimate of the true risk based on the training set you have.
By the √N law from Ch.5 §1, the empirical risk estimates the true risk with standard error σ_L/√N. So if your training set is small, your empirical risk is a noisy proxy for what you actually care about — and the parameter that minimises empirical risk is generally not the parameter that minimises true risk. The gap is generalisation error.
This is why every training loop pairs the empirical-risk minimisation with techniques to keep the parameters from overfitting:
- A held-out validation set. Compute the loss on data the model wasn’t trained on; stop training when validation loss stops improving.
- Weight decay / L2 regularisation. Penalise large parameter values, biasing toward simpler solutions.
- Early stopping, dropout, data augmentation, batch normalisation. Each reduces effective capacity in a different way.
We’ll see how each interacts with the optimiser in §8.3.
MSE is the negative log-likelihood under a Gaussian noise model. So MSE’s optimum is the maximum-likelihood estimate IF the residuals are Gaussian. When they’re NOT — e.g., when there are heavy-tailed outliers, when the data has high-leverage points, when there are mislabelled examples — Gaussian assumption fails and MSE produces a biased fit.
MAE corresponds to a different implicit noise model: the Laplace (double-exponential) distribution, which has heavier tails than Gaussian. Minimising MAE = maximum likelihood under Laplace noise. The Laplace distribution puts more probability on large residuals, so the optimiser is less surprised by outliers — they have less influence on the fit.
Operational rule: if your data is roughly Gaussian, MSE. If you have outliers you don’t want to dominate, MAE or Huber. (Huber gets the best of both: MSE-like smoothness near the optimum, MAE-like robustness in the tails. Default choice in many robust regression libraries.)
This is one of the cleanest connections to make: every loss is implicitly an assumption about the noise distribution on your data. Pick the loss that matches what you believe about the data.
Some other losses worth knowing
A non-exhaustive but useful list:
| Loss | Use case | Implicit noise assumption |
|---|---|---|
| MSE / L2 | Regression, default | Gaussian |
| MAE / L1 | Regression with outliers | Laplace |
| Huber | Robust regression with smoothness | Mixed Gaussian/Laplace |
| Cross-entropy | Classification with calibrated probabilities | Categorical/multinomial |
| Hinge | Margin classification (SVM-era) | — |
| Focal loss | Object detection with class imbalance | — |
| Triplet loss | Metric learning (anchors, positives, negatives) | — |
| Contrastive loss | Self-supervised pretraining | — |
| KL divergence | Distribution matching | — |
| InfoNCE | Self-supervised contrastive | Categorical with negative sampling |
Modern ML pipelines often mix several losses with weighted sums — e.g., RLHF combines a policy loss with a value loss with a KL regulariser to the SFT model. The choice of which losses to combine, and with what weights, is the dominant lever for shaping behaviour.
R̂ is a Monte-Carlo estimate of R from N independent samples. By the √N law (Ch.5 §1), the standard error of this estimate is σ_L / √N where σ_L is the standard deviation of the loss across the data distribution.
So |R̂(θ) − R(θ)| is typically O(σ_L / √N) at any FIXED θ.
The catch: when we MINIMISE R̂(θ) over θ, we’re searching for the θ where R̂ happens to be lowest — which may be lower than R at that same θ because of the per-θ noise. The optimal-R̂ θ is biased to over-fit the noise in the empirical loss. This bias-from-optimisation is the source of generalisation error.
Bound: a uniform convergence argument (Vapnik 1971) shows sup_θ |R̂(θ) − R(θ)| ≤ O(√(VC-dim · log(N) / N)). The capacity term (VC-dim, model complexity) appears because we’re taking a max over a model class, not a single θ.
Operational consequence: empirical training loss is a noisy proxy for true risk. To get an UNBIASED estimate of R, evaluate the model on a held-out set the optimiser never saw — that’s exactly what a validation/test set is. Early stopping based on validation loss is a practical way to use this independent estimate to avoid over-fitting the empirical loss.
Modern caveat (DL post-2015): the VC-style bounds dramatically OVER-estimate generalisation error for huge over-parameterised networks. The implicit biases of SGD and feature learning give effective generalisation much better than VC predicts. But validation sets remain the operational ground truth — every production training run validates on held-out data because the theoretical bounds aren’t reliable enough to skip the empirical check.
END OF CH.8 §1 — Loss functions & empirical risk.
Built: LossLandscape viz (slide w, see how the optimum moves depending on whether you optimise MSE / MAE / Huber — different losses give different fits on the same data). Three recall items: easy (loss as contract), medium (MSE vs MAE and the implicit noise assumptions), hard (the empirical-vs-true-risk gap derived from the √N law).
Coming next: §8.2 — Gradient descent and stochastic gradient descent. How to actually find the θ that minimises empirical risk.