Routing + sparse activation

Section 17.1

Routing + sparse activation

Until 2023, every commercial LLM was dense: every parameter participated in every token’s forward pass. This is the simplest design but also the most expensive — your inference cost scales linearly with your parameter count, and there’s no way around it. Mixture-of-Experts (MoE) breaks this coupling. The intuition: not every token needs every parameter. The token “the” doesn’t need the math-specialised parameters. The token “integral” doesn’t need the code-specialised ones. MoE makes this concrete: the FFN sublayer is replaced with N small FFNs (experts) + a small router. The router picks the top-k experts per token (typically k=1 or 2 out of N=8 or 32); only those experts run. Parameters scale with N. Compute scales with k. Mixtral 8x7B (Jiang 2024) was the breakout proof: 47B total parameters, 13B active per forward pass, performance close to dense 70B models. By 2024-2025, every frontier model is MoE.

Replace the FFN with N experts + a router

Recall the Llama-style transformer block from Ch.15: pre-norm → attention → residual → pre-norm → FFN (SwiGLU) → residual. The FFN is the bulk of the parameters (~67% of the block; §15.1). MoE keeps everything else the same and replaces the FFN with:

# Dense FFN (the baseline) out = FFN(x) where FFN is one SwiGLU with weights W_gate, W_up, W_down. Cost per token: O(d · F + F · d) = O(d · F) FLOPs. # F ≈ 2.7·d # MoE FFN N experts: Expert_1, Expert_2, ..., Expert_N (each = small SwiGLU) Router: W_router ∈ ℝ^{d × N} (the gate network) For each token x: logits = x · W_router # shape (N,) probs = softmax(logits) # shape (N,) top_k_idx, top_k_w = top_k(probs, k) # k indices, k weights # weights renormalised to sum to 1 out = Σ_{i ∈ top_k_idx} top_k_w[i] · Expert_i(x) Cost per token: O(d · N) for the router + k · O(d · F) for the experts. Memory required: N · O(d · F) for all experts (in HBM). Compute per token: ~k/N of the dense-equivalent cost.

Mixture of Experts sparse architecture A transformer variant that replaces each FFN sublayer with N small FFNs (experts) + a router network. The router computes per-token gating probabilities; only the top-k experts (typically k=1 or 2) compute for each token. Total parameter count scales with N (capacity); per-token compute scales with k (much smaller). Shazeer 2017 introduced sparsely-gated MoE; Fedus 2021's Switch Transformer simplified to k=1; Mixtral 8x7B (2024) was the breakout commercial demonstration. is the routing + sparse-activation pattern. Every modern MoE follows this same template; the variation is in (N, k, expert size, routing details).

The MoE kernel from this chapter implements the routing logic — softmax over gate logits, top-k selection, weighted sum of expert outputs, plus the load-balancing auxiliary loss (§17.3):

— think, then check —

For one token x ∈ ℝ^d:

Router computes gate logits: g = x · W_router where W_router ∈ ℝ^(d × N). Output: N gate scores, one per expert.
Softmax: p = softmax(g), a probability distribution over N experts.
Top-k selection: pick the indices of the k highest probabilities. Renormalise their probabilities to sum to 1.
Sparse compute: only the chosen k experts run. out = Σ_(i in top-k) p[i] · Expert_i(x).
Result: a token-specific weighted combination of k expert outputs, instead of one fixed FFN’s output.

Parameter counts:

Router: d × N parameters. For Mixtral (d=4096, N=8): 32,768 params per router per layer. Tiny.
Each expert: 3 · d · F = 3 · 4096 · 14336 ≈ 176M params. For Mixtral 8x7B: 8 experts × 176M = 1.4B params per layer’s experts.

So the router is essentially free (0.002% of the expert parameter count). All the capacity is in the experts; the router just decides which ones to use.

↳ §17.1 routing

Mixtral 8x7B — the canonical numbers

Mixtral 8x7B (Mistral AI, December 2023) was the model that proved MoE works at chat-quality scale outside of Google’s labs. Its config is the reference point for every subsequent open-source MoE.

Mixtral 8x7B configuration: d_model = 4096 d_ffn (F) = 14336 ← per expert n_layers = 32 n_heads = 32 (multi-head attention, not MoE-d) n_kv_heads = 8 (GQA) vocab = 32000 context_len = 32768 (with sliding window attention) # MoE specifics n_experts = 8 ← per layer top_k = 2 ← routing: top-2 experts active per token router_params = d · N · n_layers = 4096 · 8 · 32 ≈ 1M (effectively free) # Parameter accounting per layer: attention: 4 · d² ≈ 67M experts: 8 · 3 · d · F ≈ 1.4B ← per layer's experts norm + router: negligible # Total parameters (whole model): Attention all layers: 67M · 32 ≈ 2.1B Experts all layers: 1.4B · 32 ≈ 45B Embedding (tied): ~131M GRAND TOTAL: ≈ 47B parameters # ACTIVE parameters per forward pass per token: Attention all layers: 2.1B Active experts (top-2): 2 · 176M · 32 ≈ 11.3B Embedding (lookup, not compute): small TOTAL ACTIVE: ≈ 13B parameters Inference cost: ~13B model Memory cost: ~47B model Quality: comparable to dense 70B

The headline: 47B parameters of capacity for the inference cost of 13B. Roughly a 3.5× capacity-to-compute ratio. This is the central reason MoE has taken over: at fixed inference budget, you get a more capable model.

Active parameters MoE accounting The number of parameters that actually compute per forward pass per token in an MoE model. For top-k routing among N experts, it's: (router + attention params) + (k/N · expert params). For Mixtral 8x7B: 13B active out of 47B total — a 3.5× capacity/compute ratio. This is the key number for comparing MoE to dense models: MoE compute cost ≈ dense model of size 'active params'; MoE quality ≈ dense model of size 'total params'. is the right metric for inference cost. Mixtral 8x7B’s inference cost per token is close to a 13B dense model’s, despite the model file being 47B’s worth.

— think, then check —

Mixtral parameter breakdown:

Attention parameters (W_Q, W_K, W_V, W_O): not gated by router — ALL of them compute every token. ~2.1B total.
Expert parameters (W_gate, W_up, W_down × 8 experts × 32 layers): only the active experts compute. Per layer: 8 experts × 176M = 1.4B; total experts: ~45B.
Embedding (tied input/output): ~131M params; only a single row lookup per token (effectively zero compute beyond the lookup).

Active params per token:

Attention (all layers): 2.1B — ALL of it active.
Experts (top-2 of 8): 2/8 × 45B = 11.25B active.
Embedding: lookup, ~0.

Sum: 2.1B + 11.25B ≈ 13.35B active.

Why not 47/4 = 11.75B:

Because attention is NOT MoE-fied. The 2.1B of attention parameters are dense — they all compute every token. Only the FFN sublayer is replaced by the MoE experts. So the gating reduction (factor of 4) only applies to the experts, not to the whole model.

If attention were also MoE’d (which DeepSeek V3 and Llama 4 attempt with different formulations), the active fraction would be closer to 11.75B. As of 2024, mainstream MoE leaves attention dense because attention parameters are smaller (per-layer 67M vs 1.4B for experts) and the MoE benefit is concentrated on the bigger sublayer.

↳ §17.1 + Mixtral config

Why this works at all

The deep question: why doesn’t sparse computation hurt quality?

The empirical answer (Shazeer 2017, Fedus 2021): specialisation. Different experts learn to handle different patterns of input. After training, you can probe what each expert “does” by feeding tokens and watching which expert gets routed. Findings:

One expert often handles math-like tokens (numbers, equations).
Another handles code-like tokens.
Another handles dialogue/conversational structure.
Many handle overlapping but related domains.

The router learns to ROUTE tokens to the “right” expert for their content. The total capacity of N experts is much larger than any single expert; but for any specific token, only the relevant 1-2 experts matter. The model gets the benefit of a 47B-parameter knowledge base while only paying the compute cost of 13B.

This is closely related to the “feature direction” framing of mechanistic interpretability (Ch.14 §3): the residual stream encodes many features in superposition. MoE makes the same idea explicit and ARCHITECTURAL — different parameter subsets are dedicated to different feature spaces, and the router learns the mapping. The depth direction (32+ layers) still does most of the work; MoE adds a width direction (8+ experts per layer) of specialisation.

— think, then check —

Dense 70B advantages:

Smaller memory footprint (140 GB fp16 vs MoE’s ~94 GB; but MoE’s total INCLUDES the unused experts).
Simpler implementation: no router, no load balancing, no all-to-all communication for expert parallelism.
Better at batch=1 latency: every parameter is used, so there’s no “wasted load” of unused experts.
Easier to fine-tune: all params see all tokens.

MoE 8x7B advantages:

~4× faster per-token inference for the same quality.
Comparable quality at much lower active-compute cost.
Scales better with capacity (you can add experts more cheaply than scaling dense).

When dense wins:

Memory-constrained deployment (mobile, edge): MoE total params must fit in RAM/VRAM; the unused-expert overhead is dead weight.
Fine-tuning research: dense is easier to LoRA, easier to interpret, easier to debug.
Batch-1 ultra-low-latency: MoE has overhead from routing decisions and potentially from expert-parallelism communication that hurts batch-1 throughput.

When MoE wins:

Batch inference at scale: routing overhead amortises across the batch; expert parallelism uses idle GPUs effectively.
Cost-per-token serving: 4× inference savings compound.
Pre-training if you have the compute: MoE pre-training is more sample-efficient per parameter activated.

Inflection point: ~7-13B active params. Below this, dense is simpler and good enough. Above this, MoE becomes the cost-effective choice. Empirically: every model from 2024+ above 10B active parameters that ships at high quality (GPT-4, DeepSeek V3 671B/A37B, Mixtral 8x22B, Llama 4 Scout/Maverick) uses MoE. Below 10B (Llama 3 8B, Mistral 7B, Qwen 2.5 7B): dense.

↳ §17.1 + Mixtral vs dense

Next: §17.2 — Capacity vs compute. The asymmetric scaling that makes MoE economically dominant: total parameters scale fast, active parameters scale slowly. DeepSeek V3 (671B / A37B), and why this changes inference economics.