Routing + sparse activation
Until 2023, every commercial LLM was dense: every parameter participated in every token’s forward pass. This is the simplest design but also the most expensive — your inference cost scales linearly with your parameter count, and there’s no way around it. Mixture-of-Experts (MoE) breaks this coupling. The intuition: not every token needs every parameter. The token “the” doesn’t need the math-specialised parameters. The token “integral” doesn’t need the code-specialised ones. MoE makes this concrete: the FFN sublayer is replaced with N small FFNs (experts) + a small router. The router picks the top-k experts per token (typically k=1 or 2 out of N=8 or 32); only those experts run. Parameters scale with N. Compute scales with k. Mixtral 8x7B (Jiang 2024) was the breakout proof: 47B total parameters, 13B active per forward pass, performance close to dense 70B models. By 2024-2025, every frontier model is MoE.
Replace the FFN with N experts + a router
Recall the Llama-style transformer block from Ch.15: pre-norm → attention → residual → pre-norm → FFN (SwiGLU) → residual. The FFN is the bulk of the parameters (~67% of the block; §15.1). MoE keeps everything else the same and replaces the FFN with:
Mixture of Experts is the routing + sparse-activation pattern. Every modern MoE follows this same template; the variation is in (N, k, expert size, routing details).
The MoE kernel from this chapter implements the routing logic — softmax over gate logits, top-k selection, weighted sum of expert outputs, plus the load-balancing auxiliary loss (§17.3):
For one token x ∈ ℝ^d:
- Router computes gate logits:
g = x · W_routerwhere W_router ∈ ℝ^(d × N). Output: N gate scores, one per expert. - Softmax:
p = softmax(g), a probability distribution over N experts. - Top-k selection: pick the indices of the k highest probabilities. Renormalise their probabilities to sum to 1.
- Sparse compute: only the chosen k experts run.
out = Σ_(i in top-k) p[i] · Expert_i(x). - Result: a token-specific weighted combination of k expert outputs, instead of one fixed FFN’s output.
Parameter counts:
- Router: d × N parameters. For Mixtral (d=4096, N=8): 32,768 params per router per layer. Tiny.
- Each expert: 3 · d · F = 3 · 4096 · 14336 ≈ 176M params. For Mixtral 8x7B: 8 experts × 176M = 1.4B params per layer’s experts.
So the router is essentially free (0.002% of the expert parameter count). All the capacity is in the experts; the router just decides which ones to use.
Mixtral 8x7B — the canonical numbers
Mixtral 8x7B (Mistral AI, December 2023) was the model that proved MoE works at chat-quality scale outside of Google’s labs. Its config is the reference point for every subsequent open-source MoE.
The headline: 47B parameters of capacity for the inference cost of 13B. Roughly a 3.5× capacity-to-compute ratio. This is the central reason MoE has taken over: at fixed inference budget, you get a more capable model.
Active parameters is the right metric for inference cost. Mixtral 8x7B’s inference cost per token is close to a 13B dense model’s, despite the model file being 47B’s worth.
Mixtral parameter breakdown:
- Attention parameters (W_Q, W_K, W_V, W_O): not gated by router — ALL of them compute every token. ~2.1B total.
- Expert parameters (W_gate, W_up, W_down × 8 experts × 32 layers): only the active experts compute. Per layer: 8 experts × 176M = 1.4B; total experts: ~45B.
- Embedding (tied input/output): ~131M params; only a single row lookup per token (effectively zero compute beyond the lookup).
Active params per token:
- Attention (all layers): 2.1B — ALL of it active.
- Experts (top-2 of 8): 2/8 × 45B = 11.25B active.
- Embedding: lookup, ~0.
Sum: 2.1B + 11.25B ≈ 13.35B active.
Why not 47/4 = 11.75B:
Because attention is NOT MoE-fied. The 2.1B of attention parameters are dense — they all compute every token. Only the FFN sublayer is replaced by the MoE experts. So the gating reduction (factor of 4) only applies to the experts, not to the whole model.
If attention were also MoE’d (which DeepSeek V3 and Llama 4 attempt with different formulations), the active fraction would be closer to 11.75B. As of 2024, mainstream MoE leaves attention dense because attention parameters are smaller (per-layer 67M vs 1.4B for experts) and the MoE benefit is concentrated on the bigger sublayer.
Why this works at all
The deep question: why doesn’t sparse computation hurt quality?
The empirical answer (Shazeer 2017, Fedus 2021): specialisation. Different experts learn to handle different patterns of input. After training, you can probe what each expert “does” by feeding tokens and watching which expert gets routed. Findings:
- One expert often handles math-like tokens (numbers, equations).
- Another handles code-like tokens.
- Another handles dialogue/conversational structure.
- Many handle overlapping but related domains.
The router learns to ROUTE tokens to the “right” expert for their content. The total capacity of N experts is much larger than any single expert; but for any specific token, only the relevant 1-2 experts matter. The model gets the benefit of a 47B-parameter knowledge base while only paying the compute cost of 13B.
This is closely related to the “feature direction” framing of mechanistic interpretability (Ch.14 §3): the residual stream encodes many features in superposition. MoE makes the same idea explicit and ARCHITECTURAL — different parameter subsets are dedicated to different feature spaces, and the router learns the mapping. The depth direction (32+ layers) still does most of the work; MoE adds a width direction (8+ experts per layer) of specialisation.
Dense 70B advantages:
- Smaller memory footprint (140 GB fp16 vs MoE’s ~94 GB; but MoE’s total INCLUDES the unused experts).
- Simpler implementation: no router, no load balancing, no all-to-all communication for expert parallelism.
- Better at batch=1 latency: every parameter is used, so there’s no “wasted load” of unused experts.
- Easier to fine-tune: all params see all tokens.
MoE 8x7B advantages:
- ~4× faster per-token inference for the same quality.
- Comparable quality at much lower active-compute cost.
- Scales better with capacity (you can add experts more cheaply than scaling dense).
When dense wins:
- Memory-constrained deployment (mobile, edge): MoE total params must fit in RAM/VRAM; the unused-expert overhead is dead weight.
- Fine-tuning research: dense is easier to LoRA, easier to interpret, easier to debug.
- Batch-1 ultra-low-latency: MoE has overhead from routing decisions and potentially from expert-parallelism communication that hurts batch-1 throughput.
When MoE wins:
- Batch inference at scale: routing overhead amortises across the batch; expert parallelism uses idle GPUs effectively.
- Cost-per-token serving: 4× inference savings compound.
- Pre-training if you have the compute: MoE pre-training is more sample-efficient per parameter activated.
Inflection point: ~7-13B active params. Below this, dense is simpler and good enough. Above this, MoE becomes the cost-effective choice. Empirically: every model from 2024+ above 10B active parameters that ships at high quality (GPT-4, DeepSeek V3 671B/A37B, Mixtral 8x22B, Llama 4 Scout/Maverick) uses MoE. Below 10B (Llama 3 8B, Mistral 7B, Qwen 2.5 7B): dense.
Next: §17.2 — Capacity vs compute. The asymmetric scaling that makes MoE economically dominant: total parameters scale fast, active parameters scale slowly. DeepSeek V3 (671B / A37B), and why this changes inference economics.