The GGML quantization family — q4_0 to q6_K, and the q4_K_M naming
Every GGUF file on Hugging Face — every Llama, Mistral, Qwen, Gemma, DeepSeek model people actually run on laptops — uses one of about a dozen GGML quantization formats. Their names look cryptic (q4_0, q4_K_M, IQ3_XXS) but they encode a precise tradeoff: bits per weight × accuracy × inference speed. This section walks through the entire family with exact byte-level struct layouts, explains the “K” (super-block) and the “M / S / L” (mixed precision per tensor) naming convention, and covers the imatrix calibration that powers the IQ-quants. By the end you’ll know exactly what file size a q4_K_M Llama 3.3 70B will produce, why q4_K_S is smaller, and what trade-off you’re making.
The legacy formats — q4_0, q4_1, q5_0, q5_1, q8_0
The original llama.cpp formats use a fixed 32-weight block size with a single fp16 scale (and optionally a min). The struct layouts straight from ggml-common.h:
The trailing “_0” means symmetric (scale only); “_1” means asymmetric (scale + min). The leading number is the integer bit width (4, 5, 8). The 0.5 bpw overhead for q4_0 comes from one fp16 (2 bytes = 16 bits) per 32 weights = 0.5 bpw — exactly the block scale cost.
q4_0 is the workhorse — simple, symmetric, accurate enough for most use cases. The kernel from this section implements its exact 18-byte layout:
ggml_half d;
uint8_t qs[QK4_0 / 2]; /* 16 bytes, 32 nibbles */
} block_q4_0;
/* Compile-time check */
_Static_assert(sizeof(block_q4_0) == 18, "block_q4_0 must be 18 bytes");
/* q4_0 quantize: one block of 32 floats → one 18-byte block_q4_0 */
static void quantize_block_q4_0(const float* x, block_q4_0* out) {
float amax = 0;
int imax = 0;
for (int i = 0; i < QK4_0; i++) {
float a = fabsf(x[i]);
if (a > amax) { amax = a; imax = i; }
}
/* d = max_signed_val / -8 ↔ scale = -d for negative range alignment.
This is the actual ggml convention: pick the scale so the value with
the LARGEST MAGNITUDE (preserving sign) maps to -8 if negative or +7
if positive. Effectively d = max / -8, the sign of x[imax] absorbed. */
float d = x[imax] / -8.0f;
if (d == 0) d = 1.0f;
float id = 1.0f / d;
out->d = fp32_to_fp16(d);
/* Pack: nibble k from x[k] goes in low 4 bits of qs[k];
nibble k from x[k+16] goes in high 4 bits of qs[k]. */
for (int k = 0; k < QK4_0 / 2; k++) {Output:
format bytes/block bpw RMSE
fp32 (baseline) n/a 32.0 0.000e+00
q4_0 (18 B / 32 weights) 18 4.5000 2.004e-03
q4_K simplified (162 B / 256 w) 162 5.0625 1.676e-03
The q4_0 RMSE on a realistic weight distribution (zero-mean Gaussian, std 0.02, sparse 10× outliers) is ~2e-3. For weights of typical magnitude ~0.02, this is ~10% relative error per weight — but the dot products that actually matter (W · x for the matmul) accumulate to much lower relative error because the per-weight errors are nearly independent.
Exact layout (18 bytes / 32 weights):
- Bytes 0-1: ggml_half d (fp16 scale)
- Bytes 2-17: uint8_t qs[16] (32 signed-4-bit values, packed)
Why 4.5 bpw, not 4.0: 4.0 bpw would be the integer cost alone. The 0.5 bpw is the fp16 scale: 16 bits / 32 weights = 0.5 bpw. So total = 4 + 0.5 = 4.5.
Nibble layout (the surprising part): in ggml’s q4_0, qs[k]‘s LOW nibble holds weight x[k]‘s quantized value; qs[k]‘s HIGH nibble holds weight x[k+16]‘s quantized value. Not x[k+1] (consecutive pair). The high nibble holds the value 16 positions later.
Why this layout: SIMD dequantization. To dequantize a block on AVX2 / NEON, you want to load 16 packed bytes and produce 32 int8 values in two SIMD registers. The (low_nibble, high_nibble) layout means a single AND + shift produces lanes 0-15; another shift + AND produces lanes 16-31. If the layout were (x[2k], x[2k+1]) (consecutive), you’d need a more expensive interleave to deinterleave for the matmul. The “split half” layout is optimal for SIMD unpacking.
This is a small but important detail — the kernel speed depends on it.
The K-quants — super-blocks with hierarchical scales
The legacy formats have 32-weight blocks with a single per-block scale. The K-quants (introduced around llama.cpp 2023) use a different structure:
- Super-block of 256 weights (= 16 sub-blocks of 16 weights each).
- One fp16 super-block scale + 16 sub-block scales packed at 6 bits each.
- The integer values use the bit width named (2, 3, 4, 5, 6).
K-quants beat the legacy formats at the same bpw because the hierarchical sub-block scales contain outlier contamination to 16 weights instead of 32. The exact byte layout cleverly packs 16 sub-block scales (and sometimes 16 sub-block mins) into 12 bytes by using 6-bit fields and the super-block d for renormalisation.
q4_0: 32-weight blocks, 1 fp16 scale per block, 4 bits per weight.
Bytes per block: 2 (scale) + 16 (qs) = 18.
Per-block scale overhead: 16 bits / 32 weights = 0.5 bpw.
q4_K: 256-weight super-block of 16 × 16-weight sub-blocks. Each sub-block has its own 6-bit scale; the whole super-block has 1 fp16 d that the sub-scales multiply against.
Bytes per super-block: 4 (d, dmin) + 12 (16 packed 6-bit scales + 16 packed 6-bit mins) + 128 (qs) = 144.
Per-weight cost: 144 · 8 / 256 = 4.5 bpw. Same as q4_0.
What changes:
- The scale granularity changed from “1 per 32 weights” (q4_0) to “1 per 16 weights” (q4_K) — twice as fine.
- The scales themselves are 6-bit instead of fp16. Each 6-bit scale is multiplied by the fp16 super-block d to reconstruct the per-sub-block scale.
- Both scale AND min are stored per sub-block (q4_K is asymmetric like q4_1).
Why it’s the same bpw despite finer scales: q4_K trades scale precision (fp16 → 6-bit) for scale granularity (32→16 weight scope). The 6-bit sub-scales are quantized values, not full-precision; the fp16 super-block d provides their global range. This compression of scale storage is what lets q4_K fit more scales without using more bpw.
Why it’s better quality: outliers concentrate in 16-weight sub-blocks instead of 32. A weight with magnitude 5σ contaminates 16 weights’ precision in q4_K vs 32 in q4_0. On real LLM weight distributions (which have sparse outliers within larger clusters), the finer granularity captures more dynamic range. Empirical perplexity gap on Llama-class models: ~0.1-0.3 in q4_K’s favour, free.
The kernel’s “q4_K simplified” version uses fp16 sub-scales (no 6-bit packing) and is 162 B / 256 weights = 5.06 bpw — and achieves slightly lower RMSE than q4_0 at the same demonstration. Real q4_K achieves the same 4.5 bpw as q4_0 by being clever about scale packing.
The q4_K_M vs q4_K_S naming — mixed precision
This is the convention that confuses everyone. The “_M” and “_S” don’t modify the format itself; they’re llama.cpp model-file flavors that mix multiple K-quant formats across a model’s tensors:
The intuition: some tensors matter more than others. The output projection and embedding (which directly affect token logits) and attn_v + ffn_down (which the empirical sensitivity analysis flags as the most error-sensitive) get bumped to a higher precision. The rest stays at the base format.
q4_K_M is the default choice for serious 4-bit deployment. Llama 3.3 70B in q4_K_M is about 42-43 GB; in q4_K_S it’s 40 GB; in fp16 it’s 140 GB. The 7% size difference between q4_K_S and q4_K_M typically buys ~0.1-0.2 perplexity, which translates to noticeably better quality on long-form generation.
What changes structurally:
q4_K_S uses q4_K (4.5 bpw) for almost every tensor. A small set of “important” tensors in early layers (attn_v + ffn_down) are bumped to q5_K (5.5 bpw).
q4_K_M upgrades MORE tensors to higher precision:
- attn_v → q6_K (6.5 bpw) for a significant subset of layers (use_more_bits returns true)
- ffn_down → q6_K for the same subset
- token_embd → q6_K (the input embedding table)
- output → q6_K (the output projection, often tied to token_embd)
The other tensors (attn_q, attn_k, attn_o, ffn_gate, ffn_up) stay at q4_K.
Why these specific tensors: empirical sensitivity analysis. The value projection (attn_v) determines the actual information passed to subsequent layers; the FFN’s down projection determines what gets written to the residual stream; the embeddings determine token-level representation quality. Errors in these compound. The query, key, gate, up projections are more error-tolerant — quantization noise in them often averages out through subsequent matmuls.
The 2 GB cost: ~5% larger file. ~5% higher RAM usage at inference. ~5% slower if you’re memory-bandwidth-bound (because matmul throughput is roughly proportional to bytes-loaded).
The quality buy: empirically, q4_K_M perplexity is 0.1-0.2 points lower than q4_K_S on Llama 7B benchmarks. For chat use, this typically manifests as fewer logical errors in long-form responses, slightly better adherence to instructions, and more accurate factual recall. For coding it’s ~10-15% fewer subtle bugs.
Worth it? For chat / interactive use: yes, the quality bump is noticeable. For batch / throughput-critical: q4_K_S wins on cost. Most production inference defaults to q4_K_M for human-facing systems.
IQ-quants and the imatrix
The newest generation of GGML formats (introduced 2024 by Ikawrakow) are the IQ-quants: IQ1_S, IQ2_XXS, IQ2_XS, IQ3_XXS, IQ3_S, IQ4_XS, etc. They achieve sub-2-bit average compression on real LLMs while preserving usable quality.
The key innovations:
- Codebook-based quantization. Instead of mapping each weight directly to an integer, IQ-quants map small groups of weights (typically 8 weights at a time) to a codebook entry. The codebook has 256 or 512 pre-trained 8-vector patterns. Each block stores a codebook index per group, plus a per-sub-block scale.
- Importance matrix (imatrix) calibration. Optional but recommended: run a small calibration corpus (~100K tokens) through the unquantized model and record per-tensor activation statistics. The quantizer uses these to weight the quantization error so that “important” weight features (those that multiply large activations) are quantized more precisely.
The imatrix file is a simple binary format: for each weight tensor, one float per feature dimension representing E[x²] over the calibration data. Generating it takes a few minutes per model on a single GPU.
The sweet spot for production deployment is q4_K_M for chat, q5_K_M for higher quality at slightly more memory, and IQ3_XXS for very memory-constrained settings. q4_0 still exists but is mostly obsolete; q4_K_M dominates the same regime with better quality. IQ1_S exists but is more of a research curiosity (perplexity degradation is severe).
What the imatrix records:
For each weight tensor W in the model, the imatrix stores a per-feature vector E[x²] — the mean squared value of the activations that multiply that weight tensor, averaged over a small calibration dataset (typically 100K-1M tokens).
Concretely for a matmul Y = X · W where X is (B, d_in) activations and W is (d_in, d_out) weights:
imatrix[i] = mean over calibration data of X[i]² (the squared activation entering input feature i).
How the quantizer uses it:
The naive quantization objective is “minimise || W − W_dq ||²” — the L2 error of the weight reconstruction.
The imatrix-aware objective is “minimise || X · (W − W_dq) ||²” — the L2 error of the matmul output, which is what actually matters for model quality.
Expanding: || X · ΔW ||² = Σ_i (X_i² · ΔW_i²) — the error in feature i contributes proportionally to E[X_i²]. So the quantization error in features with LARGE activations matters more than in features with SMALL activations.
The imatrix-aware quantizer (used by IQ-quants and optionally by K-quants) chooses quantization grid points and codebook entries to minimise this WEIGHTED error. Features with large E[x²] get tighter quantization (better precision); features with small E[x²] can be quantized more coarsely.
Why this matters more at low bit widths:
At 6-bit and above, there are enough quantization levels that even uniform precision is “good enough.” At 4-bit, the gap matters but is small. At 2-3 bit, the difference between “uniform precision” and “activation-weighted precision” is the difference between a working model and a broken one.
Cost: ~5 minutes per model to generate the imatrix. A few MB of disk for the file itself. Free at inference time — the imatrix is consumed only during quantization, not during model use.
The deeper connection to AWQ: AWQ does essentially the same thing for full-precision GPU inference (use activation statistics to scale weights). The imatrix brings this into the on-disk GGML format. Different deployment targets (GPU vs CPU/Mac), same underlying mathematical idea: quantize so the matmul output error is minimised, not the weight error.
Next: §24.3 — Quantization-aware training. So far we’ve quantized AFTER training (PTQ). Now we’ll quantize DURING training: the straight-through estimator (STE), Learned Step Size Quantization (LSQ), BitNet’s 1-bit-from-scratch approach, and QLoRA — the technique that lets you fine-tune a 70B model on a single 24 GB GPU by keeping the base in 4-bit and only updating fp16 LoRA adapters.