Quantization & the int8 dot product
§2 had you store one real as scale × integer. That’s fixed-point quantization in its purest form, and it works fine when your data is centered at zero. Real ML tensors aren’t: ReLU outputs are non-negative; attention scores are mostly small with rare large peaks; biases are wherever they are. Affine quantization adds one more parameter — a zero-point — and lets the int8 grid recenter wherever the data wants. That generalisation is the substrate of every modern quantized inference stack. Then we make it run: the int8 dot product kernel that lives inside vLLM, MLX, Qdrant’s TurboQuant, and everything else that compresses dot products. On modern hardware it’s one fused instruction. Worth knowing exactly which one, and why.
Affine quantization, in one formula
Three operational notes:
- Symmetric quantization sets zero_point = 0. Fine for data with roughly zero mean (weights, normalized activations). Drops one parameter to remember per tensor. Most weight quantization in production uses this.
- Asymmetric quantization uses a nonzero zero_point. Required when the data isn’t centered — ReLU outputs are in [0, +\infty), so you’d waste half the int8 range on values that can’t occur. Asymmetric quantization shifts the int8 grid so 0 of int8 maps to whatever the data’s actual minimum is.
- Per-tensor vs per-channel. One scale/zero-point pair for the whole tensor (cheap, lossy for tensors whose channels span different ranges) versus one pair per row of a weight matrix (more storage — usually one float per row — much less lossy). Production transformers nearly always use per-channel for weights and per-tensor for activations.
Calibration is the whole game
The viz makes it visceral. Slide the signal amplitude and the quant range. Two failure modes:
scale to match
the signal's actual range. Too small → clipping at peaks (irreversible info loss). Too large →
wasted resolution (every value uses fewer effective bits than int8 allows). The art of
quantization is picking the range that minimises both at once.- Range too narrow → clipping. The peaks of the signal exceed ±127 × scale and saturate. Information is lost irreversibly — there’s no calibration trick at inference time that can recover what saturated.
- Range too wide → wasted resolution. The 256 int8 values cover a larger range than the signal occupies, so the gap between adjacent representable values is bigger than it needs to be. Every value carries more rounding error than the bit budget required.
The right answer is to match the range to the data — and “the data” usually means “the empirical 99th-percentile of activations on a calibration set.” Production quantizers run a few hundred batches through the model in float32, record per-tensor (or per-channel) value distributions, and pick scale/zero-point to minimize a calibration loss. The calibration tools are torch.ao.quantization, NVIDIA’s TensorRT calibrator, ONNX Runtime’s QDQ tools — they all solve the same one-dimensional optimisation problem this viz visualises.
real ≈ scale × (quantized − zero_point).
scale: a positive float that sets the size of one int8 step in real units.
zero_point: an integer that says which int8 value maps to real-valued zero.
The zero-point matters whenever the data isn’t centered around zero. For ReLU outputs in [0, M], you’d set zero_point = -128 so the int8 range [-128, 127] maps to real range [0, M]. For symmetric data (weights, normalized activations) zero_point = 0 and the formula reduces to plain fixed-point: real ≈ scale × quantized.
The accumulator width — again
Now build the dot product. Each int8 × int8 product is at most 127² ≈ 16,000; signed, the range is [-128·127, 128·128] ≈ [-16,256, +16,384]. That’s 16 bits of range per product. Sum a thousand of them and the running total can hit ±16,000,000 — comfortably inside int32’s ±2 × 10⁹ range, far outside int16’s ±32,000.
So every int8 dot product kernel needs to accumulate in int32. The hardware confirms this — the fused intrinsics we’ll meet next all have int32 accumulators and int8 operands. The width gap is the entire reason these instructions exist as a separate ISA family.
The int8 dot product kernel
Two hardware paths, two eras. The pre-VNNI one (maddubs + madd + add) needs three instructions per 32 byte-products and is the fallback on older x86. The modern one (VPDPBUSD on x86, SDOT on ARM) collapses the same work into one instruction.
Path 1 — AVX2 with _mm_maddubs_epi16
The classical Intel intrinsic for int8 × int8. Its quirk: one operand must be unsigned (u8), the other signed (s8). That dates back to SSSE3 and an encoding-space accident; production code works around it by shifting one vector into u8 (add 128) and subtracting the bias afterwards.
_mm256_maddubs_epi16Reads as: “on a 256-bit register, multiply-add pairs of u8×s8, output s16 lanes.” It does 32 byte multiplies, then horizontally adds adjacent pairs, producing 16 s16 lanes — each containing a[2k]·b[2k] + a[2k+1]·b[2k+1]. Already partially reduced.
The full chain widens to int32 with a second instruction:
/* Multiply 32 u8 × 32 s8 → 16 s16 partial products with horizontal pairing,
* widen to s32, accumulate. */
int dot_i8_avx2(const unsigned char* a, const signed char* b, int n) {
__m256i acc = _mm256_setzero_si256();
int i = 0;
for (; i + 32 <= n; i += 32) {
__m256i va = _mm256_loadu_si256((const __m256i*)(a + i));
__m256i vb = _mm256_loadu_si256((const __m256i*)(b + i));
/* maddubs: (u8 × s8) → s16, then add adjacent pairs → s16 lanes.
* Result is 16 int16 values, each = a[2k]*b[2k] + a[2k+1]*b[2k+1]. */
__m256i s16 = _mm256_maddubs_epi16(va, vb);
/* madd: (s16 × s16) → s32, then add adjacent pairs → s32 lanes.
* Using the all-ones vector for one operand turns this into a
* widening-add-pairs over the 16 int16s, giving 8 int32 partial sums. */
__m256i ones = _mm256_set1_epi16(1);
__m256i s32 = _mm256_madd_epi16(s16, ones);Three operations per 32-byte stride: maddubs, madd_epi16 against an all-ones vector (which just widening-adds adjacent s16 pairs to s32), and an add_epi32 into the accumulator. Tolerable. Still ~3× the throughput of a float32 FMA loop because the lane width is 4×.
Path 2 — VPDPBUSD (AVX-VNNI, 2018+)
Intel added a single instruction that does the whole chain: VPDPBUSD. One fused instruction per 32 byte-products, accumulator stays in the register across iterations:
_mm256_dpbusd_avx_epi32acc = dpbusd(acc, a, b) means acc[i] += a[4i]·b[4i] + a[4i+1]·b[4i+1] + a[4i+2]·b[4i+2] + a[4i+3]·b[4i+3] for each of the 8 s32 lanes. One instruction, 32 byte-products, 8 accumulator updates.
/* ---- AVX-VNNI path (one fused instruction per 32 byte products) ---------- */
/* VPDPBUSD does u8 × s8 → s32 dot-of-4, fused with an accumulator update,
* in a single instruction. Available on Ice Lake, Tiger Lake, Sapphire Rapids,
* and Zen 4+. The kernel collapses to: load, load, dpbusd, repeat. */
#if defined(__AVXVNNI__)
int dot_i8_vnni(const unsigned char* a, const signed char* b, int n) {
__m256i acc = _mm256_setzero_si256();
int i = 0;
for (; i + 32 <= n; i += 32) {
__m256i va = _mm256_loadu_si256((const __m256i*)(a + i));
__m256i vb = _mm256_loadu_si256((const __m256i*)(b + i));Same kernel structure, half the instructions per inner step, no widening dance.
Path 3 — NEON SDOT
ARM’s equivalent, present on Apple Silicon and every ARMv8.2-A core (essentially everything shipping since 2018): SDOT (vdotq_s32):
int dot_i8_neon(const signed char* a, const signed char* b, int n) {
int32x4_t acc = vdupq_n_s32(0);
int i = 0;
for (; i + 16 <= n; i += 16) {
int8x16_t va = vld1q_s8(a + i);
int8x16_t vb = vld1q_s8(b + i);
/* SDOT: 16 s8×s8 → 4 s32 (groups of 4 dotted, then accumulated). */
acc = vdotq_s32(acc, va, vb);
}
int result = vaddvq_s32(acc); /* 4 s32 lanes → 1 scalar */
for (; i < n; i++) result += (int)a[i] * (int)b[i];
return result;
}The kernel is unrecognisably short compared to the AVX2 chain — load, load, sdot, repeat. Same dot4-then-accumulate semantics as VPDPBUSD, signed-signed instead of unsigned-signed.
Now make it run
The test (test_int8.c) generates a 1024-long pair of float vectors, quantizes them, runs the SIMD int8 dot product, and compares to the float32 reference. On Apple Silicon NEON:
int8 dot product: N = 1024
scale_a = 0.00629921 scale_b = 0.00551181
float32 reference = -173.996412
dot_i8_neon = -174.029798 abs err = 3.34e-02
-> single-byte storage at this scale stays within ~1% of the float reference
Relative error is 3.34 \times 10⁻² / 173.99 ≈ 0.02%. At one byte per value — 1/4 the storage of float32 — the dot product agrees with the float reference to four significant figures. That’s the headline number for quantized vector search and quantized weight matmul both: you pay 1/4 the bytes and lose a fraction of a percent of accuracy. For ranking applications (ANN, attention scoring) the loss is invisible at the task level.
This kernel is Qdrant’s hot path. The TurboQuant scoring loop in lib/quantization/src/turboquant/ is structurally the kernel above, with one extra trick on top: the rotated vectors (Ch.2 §3) have equalised per-coordinate variance, so a single scale works for all of them, and the per-vector ⟨q,v⟩ drops out as one VPDPBUSD or SDOT call per database vector. That’s the FlashAttention insight applied to vector search — and it’s why HNSW with TurboQuant is roughly an order of magnitude faster than HNSW with float32 distances on the same hardware. We close this loop in Ch.25.
Each int8 × int8 product is in [-16384, +16384], roughly 16 bits. Sum 1024 of them — the dot-product length common in attention head dimensions and embedding spaces — and the worst-case sum can reach ±1024 × 16384 ≈ ±1.7 × 10⁷.
int16 range is ±32K, far too small — the accumulator would overflow long before the dot product completed. int32 range is ±2 × 10⁹, comfortably larger than the worst case. The hardware confirms this design: VPDPBUSD and SDOT both have int8 operands and int32 accumulators by spec. The width gap is the structural reason the dpbusd / sdot instructions exist as a separate ISA family.
x86 inherited the u8 × s8 convention from SSSE3’s maddubs, where Intel used spare encoding-space cleverness to fit a “byte multiply-add” into one instruction. The unsigned × signed pairing wasn’t a math choice; it was a compatibility choice. VNNI’s VPDPBUSD kept it for backwards compatibility with the older chain.
ARM’s SDOT, designed from scratch in 2018, picked the cleaner s8 × s8 pairing.
Production cross-platform kernels (Qdrant, ggml, MLX, ONNX Runtime) handle the mismatch by shifting one operand to unsigned on x86: store v as (signed) and apply the bias correction at calibration time, so the dot-product math comes out the same. Storage layout stays signed for portability; only the kernel-internal representation differs. (This is the same trick this section’s test_int8.c uses.) The cost is one extra addition per vector during calibration, amortised over millions of queries — invisible.
END OF CH.3 — Floating point, integers, and quantization error.
§1 (float anatomy) · §2 (integers, fixed-point, uniform precision) · §3 (affine quantization + the int8 dot kernel).
All three sections compile and run; the int8 dot product test confirms 0.02% relative error against the float reference at one byte per value. Nine recall items chain back to Ch.1’s dot product and Ch.2’s matmul and forward to Ch.13 (FlashAttention precision discussion) and Ch.25 (TurboQuant scoring). Coming next: Ch.4 — Calculus and gradients refreshed.