Orthogonal & rotation matrices

Section 2.3

Orthogonal & rotation matrices

All linear functions are not created equal. Some of them — the orthogonal ones — have a remarkable property: they preserve every length and every angle they touch. Apply one to two vectors and the two vectors’ norms are unchanged, their dot product is unchanged, the angle between them is unchanged. From the inside of an orthogonal transformation, you cannot tell anything has happened. This sounds like an algebraic curiosity. It is in fact the structural foundation under three things modern ML cannot live without: rotation-based quantization (Ch.25), the Johnson–Lindenstrauss lemma (Ch.7), and the random Hadamard transforms inside QJL and a half-dozen other compression schemes. This section gives you the algebra and the geometric picture both, then runs the invariants in real code.

Definition: Q^T Q = I

A square matrix Q is orthogonal core term A square matrix Q with QᵀQ = I — i.e., its columns (equivalently, its rows) form an orthonormal set. Geometrically: a length- and angle-preserving linear function. Then → now: same definition, much more frequently invoked. Modern ML constantly does 'randomly rotate then compress' (TurboQuant, QJL, random projections), and 'rotate' here always means 'apply some orthogonal Q' — usually a Hadamard, sometimes a Householder product. iff:

QᵀQ = I equivalently: Q⁻¹ = Qᵀ

Read column-by-column, this says Q’s columns are an orthonormal set: each column has length one, and any two distinct columns are perpendicular. The notation packs the whole condition into one equation — every entry of QᵀQ is one inner product between two columns of Q, and the identity matrix demands “1 on the diagonal (each column with itself = unit length), 0 off the diagonal (different columns orthogonal).”

The second form is the one to keep close at hand: for an orthogonal matrix, the inverse is the transpose. Computing Q⁻¹ for general matrices is expensive and numerically delicate; for orthogonal ones it costs zero — you already have Q, you read it sideways. This is why orthogonal matrices are everywhere a kernel needs an invertible linear transformation that’s also cheap.

The two invariants

Here is the structural payoff, in two lines. Pick any vectors x, y and any orthogonal Q:

‖Qx‖ = ‖x‖ // length preserved ⟨Qx, Qy⟩ = ⟨x, y⟩ // dot product preserved

Both follow from QᵀQ = I by one line of algebra:

⟨Qx, Qy⟩ = (Qx)ᵀ (Qy) = xᵀ Qᵀ Q y = xᵀ I y = xᵀ y = ⟨x, y⟩

The length invariant is a special case: ‖Qx‖² = ⟨Qx, Qx⟩ = ⟨x, x⟩ = ‖x‖². So really there is one invariant — the dot product — and length comes along as the dot product of a vector with itself.

That single fact is heavier than it looks. Modern ML scores almost everything with dot products (cosine similarity, attention scores, ANN ranking). Anything you do before the scoring that’s orthogonal leaves the scores untouched. You can rotate every vector in a database by any orthogonal matrix you like — including one you randomly generate — and not a single relative score changes. The ranking you’d get from the rotated database is bit-identical to the ranking from the original. That degree of freedom is what every “rotate then compress” scheme exploits.

See it

The viz below shows two vectors before (dotted) and after (solid) a transformation. The readings on the right report their lengths and dot product — values that stay green are invariant under the chosen transformation. Flip between rotation, reflection, scale, and shear, and watch which numbers move.

apply a

✓ this matrix is orthogonal — Qᵀ Q = I

angle θ 45°

‖a‖ 1.08 → 1.08

‖b‖ 1.04 → 1.04

⟨a, b⟩ 0.700 → 0.700

∠(a, b) 51.50° → 51.50°

Two vectors before (dotted) and after (solid). For orthogonal matrices, every length and every dot product survives — values stay green. For arbitrary matrices they drift. That preservation is the structural reason rotation-based quantization (Ch.25) works at all.

Three things to notice as you flip modes:

Rotation and reflection are orthogonal; scale and shear are not. The status banner at the top reads orthogonal for the first two and not orthogonal for the others. Confirm the math: a rotation’s two columns are (\cos θ, \sin θ) and (-\sin θ, \cos θ) — each unit length, mutually perpendicular. A scale’s columns are (s_x, 0) and (0, s_y) — perpendicular but generally not unit length unless both scales equal 1.
Length-preservation and dot-preservation move together. They never decouple. That’s because the invariants come from the same algebraic fact — QᵀQ = I — and both fail at once when the matrix isn’t orthogonal.
Reflection is orthogonal too. It’s tempting to call orthogonal matrices “rotations” loosely — they aren’t. Reflections preserve every length and dot product just as well; they just flip orientation (the determinant is -1 instead of +1). The “orthogonal group” includes both; the “special orthogonal group” SO(n) is just the rotations.

The slider for “reflection” picks the angle of the mirror line. At 0° it reflects across the x-axis; at 45° across the diagonal. Watch the two vectors flip — yet their pairwise angle is unchanged. Whatever picture you have of “reflection breaks geometry,” update it: reflection only flips orientation, not measurement.

Why this matters for compression

The chain of reasoning the rest of the book leans on:

Scoring is dot products. Cosine similarity, attention, distance ranking — all built from ⟨q, x⟩.
Dot products are invariant under orthogonal transforms. ⟨Qq, Qx⟩ = ⟨q, x⟩ for every Q with QᵀQ = I.
So you can rotate before storing. If every database vector is replaced by Qv and every query is replaced by Qq, the scores are identical. The ranking is unchanged.
But the rotated vectors have nicer per-coordinate statistics. Specifically, a random rotation equalizes the variance across coordinates — turns anisotropic data isotropic. This makes a single, simple quantizer work for every coordinate, instead of needing a per-coordinate codebook.

Steps 1–3 are this section’s content; step 4 is Ch.5 (variance) plus Ch.6 (high-dim concentration). The combined chain is what TurboQuant and RaBitQ are. The license to even try “rotate then quantize” comes from these two preserved invariants.

— think, then check —

Qᵀ Q = I. Read entry-by-entry: the (i, j) entry of QᵀQ is the dot product of column i with column j. The right-hand side I says that’s 1 when i = j (each column has unit length) and 0 when i ≠ j (different columns are orthogonal). So the columns form an orthonormal set — pairwise perpendicular and each of unit length. Equivalently: Q⁻¹ = Qᵀ.

↳ §2.3 definition

Rotations, reflections, and Hadamards

Inside the orthogonal group, three constructions appear over and over in ML:

Rotations (SO(n)): orientation-preserving. The 2D rotation matrix R(θ) from §1 generalises to higher dimensions as products of “plane rotations” — pick two axes, rotate within that 2D plane, leave the rest alone. Composing many such rotations gives any element of SO(n).
Householder reflections: each one flips space across the hyperplane normal to some unit vector v:

H = I − 2 v vᵀ (where v is a unit vector)

Householder reflections are cheap — each one is a rank-1 update, costing O(n) to apply, not O(n²). Compose a handful of them and you have a fully general random orthogonal matrix. This is exactly how the test below builds its Q, and it’s the workhorse of every QR factorisation routine in BLAS.

Hadamard matrix (and its Walsh–Hadamard transform): an orthogonal matrix whose entries are all ± 1/√n. It can be applied in O(n log n) using only additions and subtractions — no multiplies. That property is why it’s the rotation of choice inside fast compression schemes: a random Hadamard transform is structurally the same kind of randomness as a fully general random rotation (in the JL sense), but with one log factor of cost. (Source: Ailon & Chazelle, 2009 — “The Fast Johnson–Lindenstrauss Transform.”)

Now make it run — verify the invariants

test_orthogonal.c (key) construct Q, verify ‖·‖ and ⟨·,·⟩ both survive

    compose(T, H1, H2, N);
    compose(Q, T,  H3, N);

    /* Two random vectors. */
    float *x = malloc(sizeof(float) * N);
    float *y = malloc(sizeof(float) * N);
    for (int i = 0; i < N; i++) { x[i] = frand(&st); y[i] = frand(&st); }

    float nx = norm2(x, N);
    float ny = norm2(y, N);
    float dxy = dot(x, y, N);

    float *Qx = malloc(sizeof(float) * N);
    float *Qy = malloc(sizeof(float) * N);
    GEMV(Q, x, Qx, N, N);
    GEMV(Q, y, Qy, N, N);

    float nQx = norm2(Qx, N);
    float nQy = norm2(Qy, N);
    float dQ  = dot(Qx, Qy, N);

    int ok = 1;
    if (!approx(nx, nQx, 1e-3f)) { fprintf(stderr, "‖Qx‖ ≠ ‖x‖: %g vs %g\n", nQx, nx); ok = 0; }
    if (!approx(ny, nQy, 1e-3f)) { fprintf(stderr, "‖Qy‖ ≠ ‖y‖: %g vs %g\n", nQy, ny); ok = 0; }
    if (!approx(dxy, dQ,  1e-3f)) { fprintf(stderr, "⟨Qx,Qy⟩ ≠ ⟨x,y⟩: %g vs %g\n", dQ, dxy); ok = 0; }
    if (!ok) return 1;

    printf("orthogonal invariants (n=%d):\n", N);
    printf("  ‖x‖     = %.4f   ‖Qx‖    = %.4f\n", nx, nQx);
    printf("  ‖y‖     = %.4f   ‖Qy‖    = %.4f\n", ny, nQy);
    printf("  ⟨x,y⟩  = %.4f   ⟨Qx,Qy⟩ = %.4f\n", dxy, dQ);

The test builds Q as the product of three Householder reflections, applies it to two random 64-dimensional vectors, and checks that lengths and dot products survive to single-float tolerance. The output speaks plainly:

orthogonal invariants (n=64):
  ‖x‖     = 4.4981   ‖Qx‖    = 4.4981
  ‖y‖     = 4.4852   ‖Qy‖    = 4.4852
  ⟨x,y⟩  = 0.2830   ⟨Qx,Qy⟩ = 0.2830
all preserved — Q is orthogonal ✓

Each line is one invariant, confirmed in 64 dimensions, in code. This is the exact computation Qdrant’s TurboQuant kernel relies on: rotate every stored vector by a Hadamard (or some other cheap orthogonal), confident that the rotated scores match the original. The §1.3 distance identity ‖q − v‖² = ‖q‖² + ‖v‖² − 2⟨q,v⟩ becomes ‖Qq − Qv‖² = ‖q‖² + ‖v‖² − 2⟨q,v⟩ — every term on the right is unchanged. The quantizer can do whatever it wants to the rotated coordinates and the ranking survives.

— think, then check —

⟨Qx, Qy⟩ = (Qx)ᵀ(Qy) = xᵀ Qᵀ Q y = xᵀ I y = xᵀ y = ⟨x, y⟩.

The whole identity collapses on the step QᵀQ = I. That’s why this is a one-line derivation and why every invariant of orthogonal matrices ultimately reduces to that one defining equation.

↳ §2.3 invariants derivation

The full preview of TurboQuant

You can now read the heart of TurboQuant — which last week was a sentence in an engineering blog post — and recognise every move:

Pick an orthogonal Q (Hadamard, cheap to apply, structurally a random rotation in JL terms — Ch.7).
For every database vector v, store \tilde v = Qv. The rotated coordinates have equalised per-coordinate variance (Ch.5–6) so a single 1-byte codebook quantizes them all well.
At query time, also rotate: \tilde q = Qq. Then compute ⟨\tilde q, \tilde v⟩.
By the orthogonal invariant, ⟨\tilde q, \tilde v⟩ = ⟨q, v⟩. Same dot product. Same score. Same ranking.

Steps 2 and 4 are §2.3 — exactly what this section proved. Steps 3 and the variance equalisation in step 2 wait for Ch.5–6. The picture is now half-assembled.

— think, then check —

Rotation preserves every dot product exactly: ⟨Qq, Qv⟩ = ⟨q, v⟩. The query-side correction is also a rotation, and the two rotations cancel under the orthogonality identity. Ranking is bit-identical.

Additive noise does not preserve dot products. ⟨q, v + ε⟩ = ⟨q, v⟩ + ⟨q, ε⟩ — the second term is non-zero in general and varies per vector. There’s no global “subtract back” you can do at query time, because each v’s noise ε is different. The ranking would distort by an amount proportional to ‖q‖‖ε‖.

The deeper distinction: orthogonal transforms are an exact change of basis — they’re reversible without information loss, and the same Q applied on both sides cancels. Additive noise is an irreversible perturbation. Compression schemes that depend on perfect score preservation can use the former; the latter is for differential-privacy-style problems where you want the noise to leak through.

↳ §2.3 + §1.3 distance identity

END OF CH.2 §3 — Orthogonal & rotation matrices.
Built: OrthogonalPreserves viz (flip transformations, watch invariants stay green); test_orthogonal.c constructs Q as a product of three Householder reflections and verifies both invariants on random 64-D vectors. Three recall items, including the synthesis question that previews exactly what TurboQuant does.
Coming next: §2.4 — Tiled GEMM microkernel. Why naïve matmul is memory-bound, and what tiling buys.