LoRA — the math and the intrinsic-dim argument
The big LoRA question: can you fine-tune a 70-billion-parameter model by training only 67 million parameters (a rank-8 update)? Empirically yes, and the empirical answer is dramatic — LoRA at rank 8-32 matches or exceeds full fine-tuning on most downstream benchmarks. But WHY does this work? The architecture argument alone (low-rank matrices have fewer parameters) doesn’t explain why low rank is enough — there’s no a priori reason the “true” fine-tuning delta should be low-rank. The structural argument comes from Aghajanyan 2020 “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”, which showed empirically that fine-tuning lives on a low-dimensional manifold of parameter space — typically 200-1000 dimensions out of the model’s billions. This section walks the LoRA formulation, the intrinsic-dim experiment that justifies it, and the kernel that demonstrates rank-4 sufficiency on a synthetic task.
The formulation
Standard fine-tuning of a linear layer y = x·W updates W:
LoRA exploits the assumption that the fine-tuning delta has low effective rank. This assumption is empirical (justified by Aghajanyan 2020 below), not a priori — but it’s overwhelmingly validated by every LoRA experiment to date.
The intrinsic-dim argument (Aghajanyan 2020)
The deeper “why does LoRA work” question is answered by Aghajanyan 2020. Their experimental setup:
Intrinsic dimension formalises the intuition that fine-tuning explores a low-dimensional manifold. The result has two parts:
- The intrinsic dimension is small — usually orders of magnitude smaller than the model’s total parameter count.
- Larger models have smaller intrinsic dimensions — the more pretraining capacity, the less the fine-tuning needs to “move” the model.
The second part is the surprising one. You might expect larger models to need more fine-tuning updates; they actually need FEWER. The pretraining has done more of the work; the fine-tuning just needs to point the right direction.
What the 700 number means:
If you constrain the fine-tuning update to lie in a random 700-dimensional subspace of the 340M-dimensional parameter space, you recover at least 90% of full fine-tuning’s task accuracy.
The “subspace” is defined by a random projection: θ_FT = θ_base + P · t, where P is a fixed random 340M × 700 matrix and t is a 700-dimensional trainable vector. You’re literally training only 700 numbers.
Result: the model fine-tunes to near-full performance using just those 700 trainable scalars. The “true” fine-tuning delta needed for this task LIVES IN a 700-dim subspace of the full parameter space.
Why larger models have smaller intrinsic dim (surprising):
The intuition: at small scale, the pretrained model “knows” only a small amount; fine-tuning has to teach it new things, requiring many parameter changes. At large scale, the pretrained model already knows most of what’s needed; fine-tuning only has to STEER the existing knowledge.
Concretely: imagine the model’s parameter space as a landscape of “useful configurations.” At small scale, the configurations are spread thin; you have to MAKE many changes to find one good for your task. At large scale, the landscape has many densely-packed good configurations; you only need a SMALL perturbation to find one matching your task.
Mathematically: the pretrained model’s “knowledge manifold” is much richer at larger scale, so the new task is closer to the existing manifold (lower distance to fix).
Implication for LoRA rank:
If the intrinsic dimension of fine-tuning a 7B model is ~1000, and the model has, say, 200 linear layers each with d × d ≈ 4096² = 16M parameters, then the intrinsic dimension is spread across these layers. A natural distribution: ~5 dimensions of update per layer.
LoRA at rank r per layer captures r² parameters per layer (the rank-r subspace has r² degrees of freedom in update). So r = 4-8 is plausibly enough per layer.
Empirically: LoRA at rank 8-16 matches full fine-tuning quality across most benchmarks for 7B+ models. The intrinsic-dim argument explains why this rank is enough.
For larger models (70B+), LoRA rank 4-8 often suffices. For smaller models (1B-3B), rank 32-64 may be needed. This matches the prediction: larger pretrained model = lower intrinsic dim per layer = lower LoRA rank needed.
Now make it run
The kernel from this section trains a tiny linear layer two ways: (1) full fine-tuning of the whole d_in × d_out weight matrix, (2) LoRA at ranks 1, 2, 4, 8, 16. The “task delta” is constructed to be intrinsically rank-4, matching the Aghajanyan claim.
/* Train LoRA: W_eff = W_base + B · A. Only A, B updated. */
static float train_lora(const float* X, const float* Y, const float* W_base,
int rank, float* W_eff_out)
{
float* A = malloc(rank * D_IN * sizeof(float)); /* A: r × d_in */
float* B = malloc(D_OUT * rank * sizeof(float)); /* B: d_out × r */
float* gradA = malloc(rank * D_IN * sizeof(float));
float* gradB = malloc(D_OUT * rank * sizeof(float));
float* gradW_eff = malloc(D_IN * D_OUT * sizeof(float));
float* W_eff = malloc(D_IN * D_OUT * sizeof(float));
/* A initialised Gaussian; B initialised to ZERO (standard LoRA init).
This ensures the LoRA delta starts at 0 — fine-tuning begins exactly at base. */
for (int i = 0; i < rank * D_IN ; i++) A[i] = 0.1f * normalf();
for (int i = 0; i < D_OUT * rank; i++) B[i] = 0.0f;
float loss = 0;
for (int e = 0; e < N_EPOCHS; e++) {
/* Compute W_eff[k, j] = W_base[k, j] + sum_r B[j, r] · A[r, k] */
for (int k = 0; k < D_IN; k++)
for (int j = 0; j < D_OUT; j++) {
float delta = 0;
for (int r = 0; r < rank; r++) delta += B[j * rank + r] * A[r * D_IN + k];
W_eff[k * D_OUT + j] = W_base[k * D_OUT + j] + delta;
}
loss = forward_mse(X, Y, W_eff, gradW_eff, N_TRAIN, 1);
/* Chain rule:
∂L/∂A[r, k] = sum over j of (∂L/∂W_eff[k, j]) · B[j, r]
∂L/∂B[j, r] = sum over k of (∂L/∂W_eff[k, j]) · A[r, k] */
memset(gradA, 0, rank * D_IN * sizeof(float));
memset(gradB, 0, D_OUT * rank * sizeof(float));
for (int k = 0; k < D_IN; k++)
for (int j = 0; j < D_OUT; j++) {
float g = gradW_eff[k * D_OUT + j];
for (int r = 0; r < rank; r++) {
gradA[r * D_IN + k] += g * B[j * rank + r];
gradB[j * rank + r] += g * A[r * D_IN + k];
}
}
for (int i = 0; i < rank * D_IN ; i++) A[i] -= LR * gradA[i];
for (int i = 0; i < D_OUT * rank; i++) B[i] -= LR * gradB[i];
}
/* Compute final effective W and copy out. */
if (W_eff_out) {Output:
LoRA vs Full Fine-Tuning on a synthetic linear task
d_in=32, d_out=16, N=256, 200 epochs
scheme rank trainable params loss after FT
base (no FT) - 0 134.76645
full FT - 512 3.98511
LoRA r=1 1 48 77.36503
LoRA r=2 2 96 28.79549
LoRA r=4 4 192 0.05982
LoRA r=8 8 384 0.04649
LoRA r=16 16 768 0.04294
Two observations:
- LoRA r=4 captures the task essentially perfectly (loss 0.06) — because the true task delta is rank 4. LoRA at the matching rank recovers the truth.
- LoRA r ≥ 4 BEATS full FT (0.06 vs 3.99). Why? Full FT has 512 trainable parameters but the task only needs 192 (= 4 · (32+16)). The extra 320 free parameters in full FT OVERFIT to the noise in the small N=256 training set. LoRA at the right rank acts as a regulariser — it can’t overfit beyond its rank constraint.
This is the production reality: LoRA isn’t just “cheaper full FT,” it’s often BETTER full FT because the rank constraint prevents overfitting on small fine-tuning datasets.
LoRA forward pass:
y = x · W_base + x · (B · A) where:
- W_base ∈ ℝ^(d_in × d_out) — the original weight, FROZEN.
- A ∈ ℝ^(r × d_in) — first low-rank matrix, TRAINABLE.
- B ∈ ℝ^(d_out × r) — second low-rank matrix, TRAINABLE.
- r ≪ d_in, d_out — the LoRA rank (typically 4-64).
The effective weight is W_eff = W_base + B · A. The shape of B · A is d_out × d_in, matching the original W layout. (Note: ordering conventions vary — some papers use d_in × d_out, but the rank-r decomposition is the key.)
Trainable parameter count:
Full FT: d_in · d_out parameters (the entire W).
LoRA: r · d_in + d_out · r = r · (d_in + d_out) parameters.
Ratio: r · (d_in + d_out) / (d_in · d_out). For d_in = d_out = d:
Ratio = 2r/d.
For d = 4096 (Llama 2 7B) and r = 8: ratio = 16/4096 = 0.004 ≈ 0.4%. LoRA uses 0.4% of the parameters that full FT uses for that linear layer.
Per-Llama-block savings:
Each attention block has 4 linear layers (W_Q, W_K, W_V, W_O); each FFN has 3 (gate, up, down). LoRA typically applies to all of them (LoRA-Q+K+V+O+gate+up+down). 7 linear layers per block × 0.4% = ~2.8% of total backbone parameters become trainable.
For Llama 2 7B (6.7B params), this is ~187M trainable parameters in LoRA at rank 8. Versus 6.7B for full FT. ~36× fewer trainable params.
Initialisation — why B starts at zero
A small but important detail in the LoRA paper: B is initialised to zero, and A is initialised to a standard random distribution (e.g., Gaussian or Kaiming).
The motivation: at initialisation, you want the model to behave EXACTLY like the base. If both A and B were random, the LoRA delta B · A would be a random perturbation at step 0 — degrading the base model before any training. By setting B = 0, the LoRA delta is exactly zero at initialisation, and the model behaves identically to the base until the first gradient step. Fine-tuning then “discovers” the delta from a zero starting point.
The asymmetric init:
A: standard random init (Gaussian or Kaiming).
B: exactly zero.
Result at step 0: B · A = 0 · A = 0. The LoRA contribution to the forward pass is zero. The model is EXACTLY the base.
Why this matters:
The first few gradient steps are critical. If the model’s behaviour at step 0 is already degraded (which random init of both A and B would cause), the initial gradients are computed from a “broken” starting point. Fine-tuning then has to first repair the degradation, then learn the task. The repair phase is wasted compute and can destabilise training.
Starting from B = 0 means: step 0 forward pass = base model. Step 1 forward pass = base + tiny correction. Fine-tuning gradually grows the correction in the right direction.
What goes wrong with symmetric small random init:
If A and B are both ~N(0, σ²) with small σ:
- B · A has entries ~ O(r · σ²) — small but non-zero.
- The LoRA contribution at step 0 is a random perturbation to the base.
- The base’s pretrained representations are perturbed before any learning happens.
- Initial loss is higher than the base model’s loss; fine-tuning has to recover the base, THEN learn the task.
Why not initialise both to zero:
If both A and B are zero, gradients are also zero:
∂L/∂A ∝ B = 0; ∂L/∂B ∝ A = 0.
The LoRA module is STUCK at zero and can’t learn. A standard problem in neural net init.
The fix: ASYMMETRIC init.
A ~ Gaussian (provides non-zero “directions” for B to explore).
B = 0 (ensures forward pass starts identical to base).
Gradient at step 0: ∂L/∂A ∝ B = 0 (A doesn’t move at step 0). ∂L/∂B = (∂L/∂W_eff) · A^T ≠ 0 (B starts learning immediately).
After a few steps, B has non-zero entries, and A starts learning too. Both matrices then co-evolve.
The asymmetric init is one of those “small detail that matters” tricks in modern deep learning — the LoRA paper noted it as essential for stable training; subsequent work has confirmed it across architectures.
Next: §19.3 — The PEFT family. Adapters (Houlsby 2019), prefix / prompt tuning, IA³, and the LoRA variants (LoRA+, DoRA, AdaLoRA). How LoRA + QLoRA came to dominate.