FLOATING POINT, INTEGERS & QUANTIZATION ERROR
Section 3.1
01

IEEE-754 refreshed

You have used float for two decades. Probably most of what you remember is that it’s 32 bits, has a sign and an exponent and a fraction, and sometimes does unexpected things with 0.1 + 0.2. That’s enough to write a binary search and not enough to understand why bfloat16 won the training-format wars, or why dot products at scale start drifting from their scalar reference, or why “epsilon” is not a single number. This section re-grounds you in the bit-level reality of a float — because every quantization technique in this book is asking how much you can shrink that representation and still get usable arithmetic.

The format

A float32 is 32 bits arranged this way:

┌─┬────────┬───────────────────────┐ │s│ exp │ fraction │ └─┴────────┴───────────────────────┘ 1 8 bits 23 bits value = (-1)^s × (1 + frac / 2²³) × 2^(exp − 127)

Three pieces, one formula. Each does specific work:

So a float32 is sign + biased exponent + mantissa, and the value is ±(1 + f) × 2^e. Play with the viz below — type any number, watch the bits, watch the reconstruction. The “gap to next float” line is the one to dwell on.

sign · 1 bit
0
+
exponent · 8 bits
01111111
127 − 127 = 0
mantissa (fraction) · 23 bits
10000000000000000000000
1.100000000000…₂
(−1)0 × (1 + 0.5000000) × 20 = 1.5000000
category normal
hex 0x3FC00000
gap to next float 1.19e-7
Other formats at the same total bit budget
signexponentfractionrangedecimal precision
float32 (32 b)1823±3.4 × 10³⁸~7 digits
float16 (16 b)1510±6.5 × 10⁴~3 digits
bfloat16 (16 b)187±3.4 × 10³⁸~2 digits
float16 and bfloat16 are the same size but spend their 16 bits differently: bfloat16 keeps float32's full range and pays for it in precision; float16 keeps better precision in a narrow range. ML inference quietly settled on bfloat16 for training (range matters more than digits when gradients can blow up).
A float32 is one sign bit, eight exponent bits, twenty-three fraction bits — the number reconstructs as (-1)^s × (1 + f/2²³) × 2^{exp − 127}. Type any value to see the bits. The "gap to next float" line shows how the spacing between representable floats grows with magnitude.

Precision is proportional to magnitude

Here is the consequence that catches everyone the first time, and the second time. With 23 bits of mantissa, the gap between two adjacent representable floats near a value v is roughly v / 2²³ ≈ v × 1.2 × 10⁻⁷. That gap is not constant — it scales with the magnitude:

near v ≈ 0.001 :  gap ≈ 1.16 × 10⁻¹⁰    (≈ 1.16 × 10⁻⁷ × v)
near v ≈ 1     :  gap ≈ 1.19 × 10⁻⁷     (≈ 1.19 × 10⁻⁷ × v)
near v ≈ 1 000 :  gap ≈ 6.10 × 10⁻⁵     (≈ 6.10 × 10⁻⁸ × v)
near v ≈ 10⁶  :  gap ≈ 6.25 × 10⁻²     (≈ 6.25 × 10⁻⁸ × v)

That’s show_bits.c reading the bits and computing the gap to the next representable float. Two takeaways:

  1. You get about 7 decimal digits of precision anywhere in the representable range. Not 7 digits past the decimal point — 7 digits relative to the magnitude. 10⁶ has the same proportional precision as 10⁻³; they just live in different scales.
  2. Near very large numbers, the gap is huge in absolute terms. At 10⁶, consecutive floats are 6 cents apart. At 10⁹ they’re tens of dollars apart. Banks famously don’t use float for money.

Why this matters for ML — the accumulation problem

A dot product of length 10,000 sums 10,000 float multiply-adds into one accumulator. If each add costs you on the order of 10⁻⁷ in relative precision and the errors add up roughly randomly, the accumulated relative error is on the order of √(10⁴) × 10⁻⁷ = 10⁻⁵. Acceptable. If the same kernel runs in float16 (relative precision ~10⁻³), the same calculation drifts by ~1% — sometimes okay, sometimes catastrophic, depending on whether the values being summed cancel. This is why fused-multiply-add matters and why production quantized inference uses int32 accumulators: the accumulator needs more precision than the operands.

We’ll re-meet this concretely in §3 when we look at the int8 dot product kernel — its inner FMA is int8 × int8 → int16, then widened to int32 for the accumulator. The widening isn’t paranoia; it’s the same accumulation argument applied to integers.

Float16 and bfloat16 — same bits, different bets

Two 16-bit floats coexist in modern ML. The viz compares them:

signexponentfractionrangedecimal precision
float321823±3.4 × 10³⁸~7 digits
float161510±6.5 × 10⁴~3 digits
bfloat16187±3.4 × 10³⁸~2 digits

Both 16-bit formats give up 16 bits relative to float32. The question is which 16 bits.

float16 spends them on precision — it keeps 10 fraction bits but drops the exponent to 5. Result: same ~3-decimal-digit precision in a much narrower range. Anything below ≈ 6 × 10⁻⁵ rounds to zero (underflow); anything above ≈ 6.5 × 10⁴ rounds to infinity (overflow). For inference on activations that have already been normalized to roughly unit scale, that’s mostly fine. For training, where gradients can be tiny, it’s a constant fight.

bfloat16 spends them differently — it keeps the full 8-bit float32 exponent and amputates the mantissa to 7 bits. The range is identical to float32; the precision drops to ~2 decimal digits. The bet, which paid off, was that for training large models, range matters more than precision. A gradient that’s 10⁻⁵⁰ in float32 is exactly representable in bfloat16 but underflows to zero in float16. (Kalamkar et al., “A Study of BFLOAT16 for Deep Learning Training,” 2019.)

This is the “bet on range” architectural choice. By 2025, bfloat16 is the default training precision on TPUs, Hopper/Blackwell GPUs, and Apple Silicon’s matmul accelerators.

The 8-bit exponent shared by float32 and bfloat16 is what makes the casts cheap. Truncating a float32 to bfloat16 is just “throw away the bottom 16 bits of the mantissa” — no exponent recalculation, no overflow checks. Hardware does it for free in the load/store path. Float16 → float32 needs an actual exponent rebias. That cost asymmetry alone is a reason kernels mix float32 storage with bfloat16 compute.

— think, then check —

1 sign bit + 8 exponent bits + 23 mantissa bits.
value = (−1)s × (1 + frac / 2²³) × 2(exp − 127).
The leading 1 in the mantissa is implicit (not stored). The exponent is biased by 127 so the raw 8-bit field can encode both negative and positive powers without a separate sign. Raw exponent 0 and 255 are reserved for subnormals/zero and infinities/NaN respectively.

Subnormals, zero, infinity, NaN

When exp = 0, the format gets weird in useful ways. The subnormal range (exp=0, frac≠0) drops the implicit leading 1, letting numbers smaller than 2⁻¹²⁶ still be representable — though with progressively fewer precision bits as you head toward zero. The alternative (“flush to zero”) creates discontinuities; subnormals smooth them out. Modern CPUs let you toggle “denormals-are-zero” (DAZ) for kernels that prefer the speed of treating subnormals as zero, and most ML kernels use that mode because the precision cost is tiny and the speed bump on x86 is real.

When exp = 255 (all ones) and frac = 0, you get ±∞. When exp = 255 and frac ≠ 0, you get NaN. NaNs propagate aggressively — any arithmetic op with a NaN input produces NaN. This is why ML training loops obsessively check for NaN: once one appears in the gradient, it spreads through the network in one step.

Now make it run — read the bits

The viz computes the bit decomposition in JavaScript; the same calculation in C using the classical union-pun idiom:

show_bits.c (key) C · IEEE-754 via type-punning
#include <string.h>

typedef union { float f; unsigned u; } u32f32;

static void show(float v) {
    u32f32 x = { .f = v };
    unsigned bits = x.u;
    unsigned sign = (bits >> 31) & 1;
    unsigned exp  = (bits >> 23) & 0xFF;
    unsigned frac =  bits        & 0x7FFFFFu;

    char expbits[9], fracbits[24];
    for (int i = 7; i >= 0; i--)  expbits [7 - i]  = ((exp  >> i) & 1) ? '1' : '0';
    for (int i = 22; i >= 0; i--) fracbits[22 - i] = ((frac >> i) & 1) ? '1' : '0';
    expbits[8] = '\0';
    fracbits[23] = '\0';

    const char* cat = "normal";
    if (exp == 0xFF && frac == 0) cat = (sign ? "-inf" : "+inf");
    else if (exp == 0xFF)         cat = "NaN";
    else if (exp == 0 && frac == 0) cat = (sign ? "-0" : "+0");
    else if (exp == 0)            cat = "subnormal";

    printf("%12g  →  s=%u  exp=%s  frac=%s   [%s]\n",

Read the output side-by-side with the viz — the bit patterns match. You’re now seeing the same number two ways: as a float value and as a 32-bit pattern. Every quantization scheme in this book asks the question “can I get away with fewer bits than this?” — and to answer it you need this picture in your head.

— think, then check —

Each add introduces ~10⁻⁷ × value of relative error. Errors from independent rounding accumulate roughly as a random walk: total error grows like √N × per-step error.
√(10⁸) × 10⁻⁷ ≈ 10⁻³ — about 0.1% accumulated relative error.
That’s the typical case. The worst case (all errors in the same direction) is closer to N × ε ≈ 10. Whether your kernel hits the typical or the worst case depends on whether the values are correlated. Sum 10⁸ values that all round up the same way — and you’re in the worst case. This is why production reductions use compensated summation (Kahan) for high-accuracy work, or simply use a higher-precision accumulator.

— think, then check —

Two reasons, both rooted in dynamic range.
(1) Gradient magnitudes span enormous scales during deep-network training — from ~10⁻¹⁰ (vanishing) to ~10⁺¹⁰ (exploding). Float16’s range only covers ~10⁻⁵ to ~10⁺⁵, so half the practical gradient distribution either underflows to zero or overflows to ∞. Bfloat16 keeps float32’s full range, so the gradient values that survive in float32 also survive in bfloat16.
(2) Casts to/from float32 are nearly free for bfloat16 (same exponent layout, just truncate the bottom 16 mantissa bits) but require exponent rebiasing for float16. Hardware kernels that mix storage (bfloat16) with accumulation (float32) get this transparently.
Net: bfloat16 trades ~1 decimal digit of precision for matching float32’s range and free conversion. For training, that’s an obviously good trade. For pure inference of well-normalized activations, float16 sometimes still wins on precision.

END OF CH.3 §1 — IEEE-754 refreshed.
Built: FloatBits viz (any number → its IEEE-754 bits, with format comparison); show_bits.c demonstrates the bit decomposition and the gap-grows-with-magnitude fact in C; three recall items spanning anatomy, accumulation error, and the float16-vs-bfloat16 design choice.
Coming next: §3.2 — Integers and fixed-point. Uniform precision, no dynamic range, and a much faster machine path.