Derivatives as sensitivity
For most working engineers, the word “derivative” calls up a memory of memorised rules — power rule, chain rule, the derivative of sine is cosine — and then a vague feeling that this all matters because of something to do with optimisation. The “something” is the entire engine of deep learning: every weight in every model gets updated by following derivatives. Before you can think clearly about that, you need to re-own the geometric and operational picture of what a derivative is. The viz below is that picture; the kernel below it is what happens when you ask the floating-point hardware to compute one.
Derivative as sensitivity
For a function f : ℝ → ℝ, the derivative at a point x₀ is defined as a limit:
The expression on the right is a slope: rise over run, with the run being the small number h. As h shrinks toward zero, the secant line between the points (x₀, f(x₀)) and (x₀ + h, f(x₀ + h)) approaches the unique tangent line at x₀. The derivative is that tangent’s slope.
Three operational readings of the same number, all worth having ready:
- Slope. The angle the tangent makes with the x-axis. Steep curve → big derivative.
- Sensitivity. “If I move x by a tiny amount, f(x) moves by roughly f’(x₀) · Δx.” This is the framing that matters for gradient descent.
- Linear approximation. Near x₀, f(x) ≈ f(x₀) + f’(x₀)(x − x₀). The right-hand side is a straight line; near x₀ the curve is well-approximated by it. This is what you’ll see in attention: every “Q is a linear projection of X” is, at the kernel level, a linear approximation of something — the projection just happens to be the entire thing.
Toggle to secant, slide h down, watch the secant line converge to the tangent. That’s the limit, made tactile. Try |x| at x₀ = 0 — the limit doesn’t converge (the two-sided slopes disagree), which is the geometric fact that |x| has no derivative at 0. ReLU has the same kink and the same non-derivability, and gradient-based training handles it by convention (use the left- or right-derivative arbitrarily — the cases where it bites are measure-zero).
f’(x₀) = lim h→0 (f(x₀+h) − f(x₀)) / h.
Slope: the gradient of the tangent line at x₀ on the curve y = f(x).
Sensitivity: a one-unit change in x near x₀ produces roughly f’(x₀) units of change in f.
Linear approximation: near x₀, f(x) ≈ f(x₀) + f’(x₀)(x − x₀); the right side is the best straight-line fit through (x₀, f(x₀)) for nearby x.
Numerical differentiation — the U-curve
If you’re doing math, you compute derivatives analytically: d/dx sin(x) = cos(x) and so on, rule by rule. If you’re a kernel and someone hands you a black-box function, you can also approximate the derivative by evaluating f at two nearby points and taking the slope. Two estimators:
The truncation-error orders come from Taylor expansion: forward keeps a first-order error in h, central cancels it and keeps only the second-order term. So in exact arithmetic, shrinking h always improves the estimate.
In floating-point arithmetic, it doesn’t. As h shrinks, f(x+h) and f(x) become almost equal, and their difference loses bits to catastrophic cancellation. Each rounded subtraction introduces error on the order of ε · |f(x)|, which is then divided by h. So the roundoff component of the error grows as 1/h. The total error is a sum:
Below those optimal h values, roundoff wins; above them, truncation wins. The classical U-curve.
The kernel sweeps h across 15 orders of magnitude and prints both estimators. The minima land exactly where theory says they should:
best forward error: 1.407e-08 at h ≈ 1e-08 (theory: h ≈ √ε ≈ 1.5e-8)
best central error: 1.114e-11 at h ≈ 1e-05 (theory: h ≈ ε^(1/3) ≈ 6.1e-6)
The “best” choices of h for sin(x) at x = 1 match the analytical predictions to within an order of magnitude — close enough to be confident the model is right.
int main(void) {
const double x = 1.0;
const double exact = dtrue(x);
printf("numerical differentiation of sin(x) at x = 1\n");
printf("exact f'(x) = cos(1) = %.15g\n\n", exact);
printf("%-12s %-22s %-22s %-12s\n",
"h", "forward diff", "central diff", "best so far");
double best_fwd = 1, best_ctr = 1;
double best_fwd_h = 0, best_ctr_h = 0;
for (int e = -1; e >= -15; e--) {
double h = pow(10.0, (double)e);
double fwd = (f(x + h) - f(x)) / h;
double ctr = (f(x + h) - f(x - h)) / (2.0 * h);
double err_fwd = fabs(fwd - exact);
double err_ctr = fabs(ctr - exact);
if (err_fwd < best_fwd) { best_fwd = err_fwd; best_fwd_h = h; }
if (err_ctr < best_ctr) { best_ctr = err_ctr; best_ctr_h = h; }
printf("1e%-3d fwd %.6e ctr %.6e\n", e, err_fwd, err_ctr);
}
printf("\nbest forward error: %.3e at h ≈ %.0e (theory: h ≈ sqrt(eps) ≈ 1.5e-8)\n",
best_fwd, best_fwd_h);
printf("best central error: %.3e at h ≈ %.0e (theory: h ≈ eps^(1/3) ≈ 6.1e-6)\n",The forward-difference minimum at h ≈ \sqrtε ≈ 1.5e-8 is the number production code uses when it has to compute a numerical gradient and there’s no analytical alternative. It’s not because someone picked it; it’s because it falls out of IEEE-754. Three lines of Taylor expansion plus the float32 epsilon and you have it.
Truncation error shrinks as h shrinks — that’s pure math.
But every floating-point subtraction f(x+h) − f(x) loses ~log₂(|f(x)|/|f(x+h) − f(x)|) bits to catastrophic cancellation: when the two numbers are close, their leading bits agree and cancel, leaving the result dominated by the floor of rounding error. That rounding error is ≈ ε·|f(x)|. Dividing by h makes the relative impact of the roundoff component grow as 1/h.
Total error = (truncation) + (roundoff/h). The two terms cross at the U-curve minimum: h ≈ √ε for forward (because truncation is O(h)) and h ≈ ε^(1/3) for central (truncation is O(h²)). This is one of the cleanest “the floating-point reality matters” results in numerics.
Why backprop doesn’t use numerical gradients
A neural network has on the order of 10⁸–10¹² parameters. To compute the gradient of the loss with respect to each one numerically, you would need:
- Two forward passes per parameter (one for f(θ_i + h), one for f(θ_i)) — call it ~10⁹ forward passes for a 5×10⁸-parameter model.
- The forward-difference floor of ~10⁻⁸ relative error per parameter, which is fine but you’d need a chip with 10⁹× the compute of an ordinary forward pass to actually train.
That’s clearly not how it works. Backprop exploits the chain rule (Ch.4 §3) to compute all derivatives in a single backward pass — total cost roughly 2–3× a forward pass, independent of parameter count. Numerical gradients survive only in two corners:
- Gradient checking. A unit-test trick: pick a random parameter, perturb it, compare the numerical derivative against the backprop result. Catches bugs in custom autograd code that the unit-test author wrote.
- Black-box surrogates. When the function is genuinely opaque (calling a simulator, a remote API, a non-differentiable rendering pipeline), numerical or evolutionary gradients are the only option.
For everything else — i.e., the entire content of Chapters 5–24 — we use the analytical chain rule applied to a known computation graph. The next two sections (Jacobians and the chain rule) are the algebra you need to do that.
Two reasons compound.
(1) Compute cost scales linearly with parameter count. Each numerical partial derivative needs at least one extra function evaluation. For 10⁸ parameters, that’s 10⁸ forward passes per backward pass — ~10⁸ times slower than a single forward pass. The forward pass itself is already expensive.
(2) You can’t even compute a single forward pass that fits 10⁸ parameters and produces a scalar loss without already doing the structural work that backprop wants to use. The forward pass already computes every intermediate quantity; backprop’s insight is that the chain rule lets you reuse those intermediates to compute all gradients in a single sweep backwards through the same graph.
The chain rule (§4.3) is what makes backprop possible — it says “if you know the local derivative at each node and you know the gradient flowing in from downstream, you can compute the gradient flowing out to upstream by one multiply.” Repeating that node-by-node walks the gradient from loss back to every parameter in O(forward pass). Without the chain rule there’s no analytic way to back-propagate the loss; with it, there’s no other reasonable way.
END OF CH.4 §1 — Derivatives as sensitivity.
Built: TangentLine viz (drag a point along five functions, toggle tangent vs secant, watch the limit); numdiff.c demonstrates the U-curve in error vs h, with empirical minima matching analytical predictions; three recall items.
Coming next: §4.2 — Partials, the gradient, and the Jacobian.