Train a tiny MLP end-to-end with AdamW

Section 10.3

Train a tiny MLP end-to-end with AdamW

Everything in Part I, Part II, and Part III’s first wave has been pointing here. Ch.4 §3 gave us the chain rule. Ch.8 §3 gave us AdamW. Ch.9 gave us a 250-line vector autograd library. Ch.10 §1 said an MLP is linear layers stacked with nonlinearities. Ch.10 §2 said ReLU is the right nonlinearity. This section glues them together. About 50 additional lines of C — one new file for AdamW state, one for the dataset and training loop — and we train a 2-layer MLP on the two-moons dataset to 100% accuracy in 500 epochs. The whole pipeline runs in 300 ms wall-clock. No PyTorch, no Python, no GPU; just ~300 lines of C built up across two chapters.

The AdamW optimiser

§9.3’s SGD step was four lines: θ ← θ − \eta · θ.grad for each parameter. AdamW (from Ch.8 §3) needs per-parameter state — first moment m, second moment s, and a global step counter t:

adamw.h C · AdamW interface

#include "../ch09/autograd.h"

typedef struct {
    float* m;       /* first moment, same shape as the param */
    float* s;       /* second moment, same shape */
    long t;         /* global step count (shared across all params) */
} AdamWState;

typedef struct {
    Tensor** params;
    AdamWState* states;     /* one per param */
    int n_params;
    float lr;
    float beta1, beta2, eps, weight_decay;
    long t;                 /* current step */
} AdamW;

void adamw_init(AdamW* o, Tensor** params, int n_params,
                float lr, float beta1, float beta2,
                float eps, float weight_decay);

The step function — the most-implemented optimiser in the world, in 15 lines of C:

adamw.c — adamw_step() C · the AdamW step

    o->eps = eps;
    o->weight_decay = weight_decay;
    o->t = 0;
    o->states = calloc((size_t)n_params, sizeof(AdamWState));
    for (int i = 0; i < n_params; i++) {
        size_t n = (size_t)params[i]->rows * params[i]->cols;
        o->states[i].m = calloc(n, sizeof(float));
        o->states[i].s = calloc(n, sizeof(float));
    }
}

void adamw_step(AdamW* o) {
    o->t += 1;
    float bias1 = 1.0f - powf(o->beta1, (float)o->t);
    float bias2 = 1.0f - powf(o->beta2, (float)o->t);

    for (int p = 0; p < o->n_params; p++) {
        Tensor* t = o->params[p];
        if (!t->grad) continue;
        float* m = o->states[p].m;
        float* s = o->states[p].s;
        size_t n = (size_t)t->rows * t->cols;

        for (size_t i = 0; i < n; i++) {
            float g = t->grad[i];
            m[i] = o->beta1 * m[i] + (1 - o->beta1) * g;

The structure mirrors Ch.8 §3’s math exactly. EMA of gradient → EMA of squared gradient → bias correction → Adam update → decoupled weight decay. The decoupled weight decay step (the last t->data[i] -= …) is what makes this AdamW and not vanilla Adam — applied separately from the gradient pipeline so it gives uniform shrinkage regardless of per-coordinate gradient magnitudes.

The two-moons dataset

A classic non-linearly-separable 2D classification problem: two interleaved half-moons, one labeled 0 and one labeled 1, with Gaussian noise. Visualised in §10.1’s decision-boundary viz. The full dataset construction:

train_moons.c — make_moons() C · synthesise two interleaved moons

static void make_moons(float* X, int* Y, float noise) {
    for (int i = 0; i < N / 2; i++) {
        float t = (float)M_PI * urand();
        X[i * D + 0] = cosf(t) + noise * normalf();
        X[i * D + 1] = sinf(t) + noise * normalf();
        Y[i] = 0;
    }
    for (int i = N / 2; i < N; i++) {
        float t = (float)M_PI * urand();
        X[i * D + 0] = 1.0f - cosf(t) + noise * normalf();
        X[i * D + 1] = 0.5f - sinf(t) + noise * normalf();
        Y[i] = 1;
    }
}

static int eval_accuracy(Tensor* X_t, const int* Y,

400 points (200 per class). Noise level 0.1. The point clouds visibly interleave; no straight line separates them.

The training loop

This is where everything from the previous chapters fits in one place. Forward pass uses the autograd ops from Ch.9; backward triggers the tape-walking from Ch.9 §2; AdamW step uses the per-parameter state we just defined. The whole loop:

train_moons.c — training loop C · forward, backward, AdamW step, log

    int epochs[] = { 0, 5, 20, 50, 100, 200, 500 };
    int next_log = 0;

    for (int epoch = 0; epoch <= 500; epoch++) {
        tape_reset();
        for (int i = 0; i < 4; i++) tensor_zero_grad(params[i]);

        /* Full-batch forward */
        Tensor* h1_pre = op_matmul(X_t, W1);
        Tensor* h1_bias = op_add_bias(h1_pre, b1);
        Tensor* h1 = op_relu(h1_bias);
        Tensor* z_pre = op_matmul(h1, W2);
        Tensor* z = op_add_bias(z_pre, b2);
        Tensor* loss = op_softmax_ce(z, Y);

        backward(loss);
        adamw_step(&opt);

        if (next_log < (int)(sizeof(epochs)/sizeof(epochs[0])) &&
            epoch == epochs[next_log]) {
            int correct = eval_accuracy(X_t, Y, W1, b1, W2, b2);
            printf("%-8d %-12.6f %-14d/%d (%.1f%%)\n",
                   epoch, loss->data[0], correct, N, 100.0 * correct / N);
            next_log += 1;
        }

Each epoch:

Reset the tape and zero gradients. Required before each forward pass — the previous step’s tape entries would otherwise pollute the new backward.
Forward pass — 5 autograd ops (matmul + add_bias + relu + matmul + add_bias + softmax_ce). Each appends to the tape.
Backward pass — backward(loss) walks the tape in reverse, applying §9.1’s VJPs at every node, accumulating gradients into the parameter tensors’ .grad slots.
AdamW step — updates each parameter using its m, s, and the current step count t.
Periodically log — re-run forward without backward, compute accuracy from the argmax of the output logits.

The result

Training 2-layer MLP (2 → 32 → 2) with AdamW on two moons
N = 400, lr = 0.010, weight_decay = 0.010

epoch    loss         accuracy
0        0.824733     234/400 (58.5%)
5        0.356073     341/400 (85.2%)
20       0.236873     354/400 (88.5%)
50       0.170579     371/400 (92.8%)
100      0.105301     386/400 (96.5%)
200      0.022406     399/400 (99.8%)
500      0.004384     400/400 (100.0%)

By epoch 100 the model has 96.5% accuracy; by 200 it’s missing one point out of 400; by 500 it’s perfect. The training loss decreased monotonically from 0.82 (random) to 0.004 (essentially saturated). The pipeline works end-to-end. A 2-layer MLP with 162 parameters (2·32 + 32 + 32·2 + 2), trained for 500 full-batch AdamW steps, hits 100% on a problem a single perceptron mathematically cannot solve.

This is the full deep-learning loop, on real silicon, in C. No Python interpreter, no autograd framework, no GPU. The whole training run completes in ~300 ms wall-clock on the Apple Silicon host. We’ve reconstructed the entire stack from matrices are functions (Ch.2 §1) through backprop is reverse-mode autodiff (Ch.4 §3 + Ch.9) through Adam with decoupled weight decay (Ch.8 §3) through ReLU networks can approximate any continuous function (Ch.10 §§1-2). The same five steps — forward, backward, optimizer step, dataset, loss — scaled up roughly 10⁹× would train a frontier LLM. The conceptual gap from here to GPT-5 is small; the engineering gap is enormous, and Parts IV-V are about closing it.

— think, then check —

(1) tape_reset() and zero gradients. Each forward pass appends to a global tape; we have to clear it (and the per-parameter .grad buffers) before each new step or we’d accumulate state from previous iterations.

(2) Forward pass. Run the model — matmul, add_bias, relu, matmul, add_bias, softmax_ce. Each call appends an op to the tape and produces an output tensor.

(3) backward(loss). Seed gradient = 1 at the loss, walk the tape in reverse, apply each op’s VJP, accumulate gradients into the input tensors’ .grad slots.

(4) adamw_step(). Update each parameter using its gradient and its AdamW state (m, s). Apply decoupled weight decay as a separate shrinkage step.

(5) Log / evaluate. Periodically compute accuracy by re-running the forward pass without backward.

This loop is structurally identical to PyTorch’s model.train() idiom: opt.zero_grad(); loss = model(x); loss.backward(); opt.step(). The Python framework adds dispatch, GPU kernels, distributed gradient sync — but the inner loop is exactly the five steps above.

↳ §10.3 training loop

Why AdamW outperforms SGD here

We could have used plain SGD (as in §9.3’s XOR example). Why AdamW?

Two reasons specific to this problem and to MLPs in general:

Per-coordinate adaptive learning rates. Different weights see gradients of very different magnitudes — the first-layer weights see gradient signal modulated by the small 2D input range; the second-layer weights see whatever the hidden layer activations are. A single SGD learning rate has to compromise. AdamW gives each coordinate its own effective learning rate, scaled by its observed gradient history.
Momentum stabilises the trajectory. The full-batch gradient on a small dataset can have direction-flipping noise from the ReLU’s piecewise structure. Momentum’s exponential moving average smooths this — the network “remembers” the direction it was heading and resists getting jerked by a single bad batch.

For a 2-layer MLP on a clean dataset, both SGD and AdamW would converge — but AdamW reaches 99%+ accuracy in ~100 epochs whereas plain SGD on the same problem typically needs 500+. The compute trade is worth it.

— think, then check —

Three things changed:

(1) Dataset size. XOR has 4 points; two-moons has 400. With more diverse gradient signal per batch, momentum (Adam’s first moment) starts to pay off because there’s a meaningful direction to average across.

(2) Hidden width. XOR had 8 hidden units; two-moons has 32. More parameters means more coordinate-specific gradient magnitudes; per-coordinate adaptive learning rates (Adam’s second moment) help match the right step size to each parameter.

(3) Decision boundary complexity. XOR’s boundary is essentially ‘avoid the two corners,’ which any solver can find. Two-moons needs a curving multi-segment boundary that exact gradient signal can navigate but noisy SGD might oscillate around.

For 2-layer toy problems, both SGD and AdamW converge — AdamW just gets there faster (~100 epochs vs ~500). For frontier models, AdamW is the only option that converges reliably at all; plain SGD on a 70B-parameter transformer would never train. The ‘when to use which’ rule is: AdamW always, except in research where you have a specific reason to compare. Cost: ~3× the optimiser memory (m + s per parameter) — that’s the trade Ch.23 partitions across GPUs in modern distributed training.

↳ §10.3 + Ch.8 §3

What the MLP actually learned

The decision-boundary viz from §10.1 shows the trained model’s output across the input space. At 100% accuracy on the training data, the MLP has learned some curving boundary that separates the moons. But it’s not unique — many curves separate the same training points. Which curve did the MLP find?

For our 2-layer ReLU MLP, the answer is: a piecewise-linear function whose “kink lines” are determined by where each ReLU unit’s pre-activation crosses zero. With 32 hidden units, there are up to 32 such kink lines in the 2D input plane. The model’s decision boundary follows segments of these kinks where the post-layer-2 logit crosses zero.

So the model’s decision boundary is a piecewise-linear curve with at most 32 segments — and the optimisation found a configuration of those segments that separates the training data. (Pascanu, Mikolov & Bengio 2014, “On the number of linear regions of deep neural networks” — formalises this for general ReLU networks.)

— think, then check —

Each ReLU unit’s pre-activation is a linear function of the input; it crosses zero along a single line in 2D. Two lines divide the plane into at most 4 regions; three into at most 7; H lines into at most H(H+1)/2 + 1 regions (the ‘arrangement of lines’ result from combinatorial geometry).

So a 2-layer ReLU MLP with H = 32 hidden units defines at most 32·33/2 + 1 = 529 distinct linear regions in 2D. Within each region, the network’s output is a fixed linear function; at region boundaries, the function bends.

Connection to UA: the universal approximation theorem says enough hidden units can approximate any continuous function on a compact domain to arbitrary precision. The mechanism IS this piecewise-linear structure — as H grows, the regions become finer, and a piecewise-linear function with arbitrarily many small pieces can approximate any smooth target. So the UA theorem’s existence claim is constructive: it’s saying “fine-grained piecewise-linear approximation can match any continuous function.” The empirical observation: ReLU MLPs do exactly this in practice.

Deeper networks compose these regions multiplicatively — a 3-layer ReLU MLP can have exponentially more regions per parameter than a 2-layer one. That’s the formal reason depth helps: same parameter budget, exponentially more expressive piecewise-linear partitions of the input space. Telgarsky 2016 proved depth-separation results making this precise.

↳ §10.3 + Ch.10 §1 UA theorem

END OF CH.10 — Neurons, layers, MLPs.
END OF PART III’S FIRST WAVE.

§1 (perceptron → MLP, universal approximation) · §2 (activation functions, sigmoid → SwiGLU) · §3 (train an MLP end-to-end with AdamW on two moons).

The deep-learning pipeline is now assembled and running. train_moons.c + adamw.h / adamw.c + Ch.9’s autograd.h / autograd.c = ~350 lines of C that learn a non-linearly-separable 2D classification problem to 100% accuracy. The conceptual stack is complete; everything later (Ch.11 tokens & embeddings, Ch.12 softmax, Ch.13 attention + FlashAttention, Ch.14 normalisation, Ch.15 the GPT architecture) is about scaling this same pipeline up by adding architecture pieces and engineering.

Coming next: Ch.11 — Tokens & embeddings. How text becomes vectors. BPE tokenisation, embedding tables, and positional encoding → RoPE.