Quantization & the int8 dot product

Section 3.3

Quantization & the int8 dot product

§2 had you store one real as scale × integer. That’s fixed-point quantization in its purest form, and it works fine when your data is centered at zero. Real ML tensors aren’t: ReLU outputs are non-negative; attention scores are mostly small with rare large peaks; biases are wherever they are. Affine quantization adds one more parameter — a zero-point — and lets the int8 grid recenter wherever the data wants. That generalisation is the substrate of every modern quantized inference stack. Then we make it run: the int8 dot product kernel that lives inside vLLM, MLX, Qdrant’s TurboQuant, and everything else that compresses dot products. On modern hardware it’s one fused instruction. Worth knowing exactly which one, and why.

Affine quantization, in one formula

real_value ≈ scale × ( quantized_value − zero_point ) scale : positive float, fixed per tensor (or per channel) zero_point : integer in [int8 range], fixed per tensor — recenters the grid

Three operational notes:

Symmetric quantization sets zero_point = 0. Fine for data with roughly zero mean (weights, normalized activations). Drops one parameter to remember per tensor. Most weight quantization in production uses this.
Asymmetric quantization uses a nonzero zero_point. Required when the data isn’t centered — ReLU outputs are in [0, +\infty), so you’d waste half the int8 range on values that can’t occur. Asymmetric quantization shifts the int8 grid so 0 of int8 maps to whatever the data’s actual minimum is.
Per-tensor vs per-channel. One scale/zero-point pair for the whole tensor (cheap, lossy for tensors whose channels span different ranges) versus one pair per row of a weight matrix (more storage — usually one float per row — much less lossy). Production transformers nearly always use per-channel for weights and per-tensor for activations.

Calibration is the whole game

The viz makes it visceral. Slide the signal amplitude and the quant range. Two failure modes:

signal amplitude ±0.80 quant range ± 1.00

scale 0.00787 max abs err 0.0039 rms err 0.0021 clipped 0 / 80

Well-matched — quant range close to signal amplitude, error is just the half-step rounding.

The classic "calibration is the whole game" picture. Choose scale to match the signal's actual range. Too small → clipping at peaks (irreversible info loss). Too large → wasted resolution (every value uses fewer effective bits than int8 allows). The art of quantization is picking the range that minimises both at once.

Range too narrow → clipping. The peaks of the signal exceed ±127 × scale and saturate. Information is lost irreversibly — there’s no calibration trick at inference time that can recover what saturated.
Range too wide → wasted resolution. The 256 int8 values cover a larger range than the signal occupies, so the gap between adjacent representable values is bigger than it needs to be. Every value carries more rounding error than the bit budget required.

The right answer is to match the range to the data — and “the data” usually means “the empirical 99th-percentile of activations on a calibration set.” Production quantizers run a few hundred batches through the model in float32, record per-tensor (or per-channel) value distributions, and pick scale/zero-point to minimize a calibration loss. The calibration tools are torch.ao.quantization, NVIDIA’s TensorRT calibrator, ONNX Runtime’s QDQ tools — they all solve the same one-dimensional optimisation problem this viz visualises.

— think, then check —

real ≈ scale × (quantized − zero_point).

scale: a positive float that sets the size of one int8 step in real units.
zero_point: an integer that says which int8 value maps to real-valued zero.

The zero-point matters whenever the data isn’t centered around zero. For ReLU outputs in [0, M], you’d set zero_point = -128 so the int8 range [-128, 127] maps to real range [0, M]. For symmetric data (weights, normalized activations) zero_point = 0 and the formula reduces to plain fixed-point: real ≈ scale × quantized.

↳ §3.3 affine formula

The accumulator width — again

Now build the dot product. Each int8 × int8 product is at most 127² ≈ 16,000; signed, the range is [-128·127, 128·128] ≈ [-16,256, +16,384]. That’s 16 bits of range per product. Sum a thousand of them and the running total can hit ±16,000,000 — comfortably inside int32’s ±2 × 10⁹ range, far outside int16’s ±32,000.

So every int8 dot product kernel needs to accumulate in int32. The hardware confirms this — the fused intrinsics we’ll meet next all have int32 accumulators and int8 operands. The width gap is the entire reason these instructions exist as a separate ISA family.

The int8 dot product kernel

Two hardware paths, two eras. The pre-VNNI one (maddubs + madd + add) needs three instructions per 32 byte-products and is the fallback on older x86. The modern one (VPDPBUSD on x86, SDOT on ARM) collapses the same work into one instruction.

Path 1 — AVX2 with `_mm_maddubs_epi16`

The classical Intel intrinsic for int8 × int8. Its quirk: one operand must be unsigned (u8), the other signed (s8). That dates back to SSSE3 and an encoding-space accident; production code works around it by shifting one vector into u8 (add 128) and subtracting the bias afterwards.

decoding · _mm256_maddubs_epi16

_mm256

256-bit AVX register

maddubs

multiply-add unsigned-byte × signed-byte

epi16

output is packed s16 lanes

Reads as: “on a 256-bit register, multiply-add pairs of u8×s8, output s16 lanes.” It does 32 byte multiplies, then horizontally adds adjacent pairs, producing 16 s16 lanes — each containing a[2k]·b[2k] + a[2k+1]·b[2k+1]. Already partially reduced.

The full chain widens to int32 with a second instruction:

dot_i8_avx2 (inner) AVX2 + maddubs chain

/* Multiply 32 u8 × 32 s8 → 16 s16 partial products with horizontal pairing,
 * widen to s32, accumulate. */
int dot_i8_avx2(const unsigned char* a, const signed char* b, int n) {
    __m256i acc = _mm256_setzero_si256();
    int i = 0;
    for (; i + 32 <= n; i += 32) {
        __m256i va = _mm256_loadu_si256((const __m256i*)(a + i));
        __m256i vb = _mm256_loadu_si256((const __m256i*)(b + i));
        /* maddubs: (u8 × s8) → s16, then add adjacent pairs → s16 lanes.
         * Result is 16 int16 values, each = a[2k]*b[2k] + a[2k+1]*b[2k+1]. */
        __m256i s16 = _mm256_maddubs_epi16(va, vb);
        /* madd: (s16 × s16) → s32, then add adjacent pairs → s32 lanes.
         * Using the all-ones vector for one operand turns this into a
         * widening-add-pairs over the 16 int16s, giving 8 int32 partial sums. */
        __m256i ones = _mm256_set1_epi16(1);
        __m256i s32 = _mm256_madd_epi16(s16, ones);

Three operations per 32-byte stride: maddubs, madd_epi16 against an all-ones vector (which just widening-adds adjacent s16 pairs to s32), and an add_epi32 into the accumulator. Tolerable. Still ~3× the throughput of a float32 FMA loop because the lane width is 4×.

Path 2 — VPDPBUSD (AVX-VNNI, 2018+)

Intel added a single instruction that does the whole chain: VPDPBUSD modern intrinsic AVX-VNNI: Vector Packed Dot Product of Bytes (Unsigned × Signed) with Dword accumulator. One instruction reads two 32-byte vectors of int8 operands and a 32-byte accumulator of int32 lanes; writes acc += sum-of-four-byte-products, eight times per lane. Available on Ice Lake / Tiger Lake / Sapphire Rapids / Zen 4+. Introduced in Cascade Lake (server, 2019) as 'AVX-512 VNNI,' then ported to client CPUs as 'AVX-VNNI' in Ice Lake (2019). One instruction does what the 2013-era maddubs+madd+add chain did in three. ML inference stacks made this a hard ISA requirement around 2021. . One fused instruction per 32 byte-products, accumulator stays in the register across iterations:

decoding · _mm256_dpbusd_avx_epi32

_mm256

256-bit AVX register

dpbusd

dot product, bytes (unsigned × signed), dword accumulator

avx

AVX-VNNI encoding (vs AVX-512 VNNI)

epi32

accumulator is s32 lanes

acc = dpbusd(acc, a, b) means acc[i] += a[4i]·b[4i] + a[4i+1]·b[4i+1] + a[4i+2]·b[4i+2] + a[4i+3]·b[4i+3] for each of the 8 s32 lanes. One instruction, 32 byte-products, 8 accumulator updates.

dot_i8_vnni (inner) AVX-VNNI · the modern path


/* ---- AVX-VNNI path (one fused instruction per 32 byte products) ---------- */
/* VPDPBUSD does u8 × s8 → s32 dot-of-4, fused with an accumulator update,
 * in a single instruction. Available on Ice Lake, Tiger Lake, Sapphire Rapids,
 * and Zen 4+. The kernel collapses to: load, load, dpbusd, repeat. */
#if defined(__AVXVNNI__)
int dot_i8_vnni(const unsigned char* a, const signed char* b, int n) {
    __m256i acc = _mm256_setzero_si256();
    int i = 0;
    for (; i + 32 <= n; i += 32) {
        __m256i va = _mm256_loadu_si256((const __m256i*)(a + i));
        __m256i vb = _mm256_loadu_si256((const __m256i*)(b + i));

Same kernel structure, half the instructions per inner step, no widening dance.

Path 3 — NEON SDOT

ARM’s equivalent, present on Apple Silicon and every ARMv8.2-A core (essentially everything shipping since 2018): SDOT (vdotq_s32):

dot_i8_neon NEON SDOT — one fused instruction

int dot_i8_neon(const signed char* a, const signed char* b, int n) {
    int32x4_t acc = vdupq_n_s32(0);
    int i = 0;
    for (; i + 16 <= n; i += 16) {
        int8x16_t va = vld1q_s8(a + i);
        int8x16_t vb = vld1q_s8(b + i);
        /* SDOT: 16 s8×s8 → 4 s32 (groups of 4 dotted, then accumulated). */
        acc = vdotq_s32(acc, va, vb);
    }
    int result = vaddvq_s32(acc);            /* 4 s32 lanes → 1 scalar */
    for (; i < n; i++) result += (int)a[i] * (int)b[i];
    return result;
}

The kernel is unrecognisably short compared to the AVX2 chain — load, load, sdot, repeat. Same dot4-then-accumulate semantics as VPDPBUSD, signed-signed instead of unsigned-signed.

Now make it run

The test (test_int8.c) generates a 1024-long pair of float vectors, quantizes them, runs the SIMD int8 dot product, and compares to the float32 reference. On Apple Silicon NEON:

int8 dot product: N = 1024
  scale_a = 0.00629921   scale_b = 0.00551181
  float32 reference  = -173.996412
  dot_i8_neon        = -174.029798   abs err = 3.34e-02
  -> single-byte storage at this scale stays within ~1% of the float reference

Relative error is 3.34 \times 10⁻² / 173.99 ≈ 0.02%. At one byte per value — 1/4 the storage of float32 — the dot product agrees with the float reference to four significant figures. That’s the headline number for quantized vector search and quantized weight matmul both: you pay 1/4 the bytes and lose a fraction of a percent of accuracy. For ranking applications (ANN, attention scoring) the loss is invisible at the task level.

This kernel is Qdrant’s hot path. The TurboQuant scoring loop in lib/quantization/src/turboquant/ is structurally the kernel above, with one extra trick on top: the rotated vectors (Ch.2 §3) have equalised per-coordinate variance, so a single scale works for all of them, and the per-vector ⟨q,v⟩ drops out as one VPDPBUSD or SDOT call per database vector. That’s the FlashAttention insight applied to vector search — and it’s why HNSW with TurboQuant is roughly an order of magnitude faster than HNSW with float32 distances on the same hardware. We close this loop in Ch.25.

— think, then check —

Each int8 × int8 product is in [-16384, +16384], roughly 16 bits. Sum 1024 of them — the dot-product length common in attention head dimensions and embedding spaces — and the worst-case sum can reach ±1024 × 16384 ≈ ±1.7 × 10⁷.

int16 range is ±32K, far too small — the accumulator would overflow long before the dot product completed. int32 range is ±2 × 10⁹, comfortably larger than the worst case. The hardware confirms this design: VPDPBUSD and SDOT both have int8 operands and int32 accumulators by spec. The width gap is the structural reason the dpbusd / sdot instructions exist as a separate ISA family.

↳ §3.3 accumulator argument · §3.1 accumulation

— think, then check —

x86 inherited the u8 × s8 convention from SSSE3’s maddubs, where Intel used spare encoding-space cleverness to fit a “byte multiply-add” into one instruction. The unsigned × signed pairing wasn’t a math choice; it was a compatibility choice. VNNI’s VPDPBUSD kept it for backwards compatibility with the older chain.

ARM’s SDOT, designed from scratch in 2018, picked the cleaner s8 × s8 pairing.

Production cross-platform kernels (Qdrant, ggml, MLX, ONNX Runtime) handle the mismatch by shifting one operand to unsigned on x86: store v as (signed) and apply the bias correction at calibration time, so the dot-product math comes out the same. Storage layout stays signed for portability; only the kernel-internal representation differs. (This is the same trick this section’s test_int8.c uses.) The cost is one extra addition per vector during calibration, amortised over millions of queries — invisible.

↳ §3.3 cross-ISA

END OF CH.3 — Floating point, integers, and quantization error.

§1 (float anatomy) · §2 (integers, fixed-point, uniform precision) · §3 (affine quantization + the int8 dot kernel).

All three sections compile and run; the int8 dot product test confirms 0.02% relative error against the float reference at one byte per value. Nine recall items chain back to Ch.1’s dot product and Ch.2’s matmul and forward to Ch.13 (FlashAttention precision discussion) and Ch.25 (TurboQuant scoring). Coming next: Ch.4 — Calculus and gradients refreshed.