MIXTURE-OF-EXPERTS
Section 17.2
02

Capacity vs compute — the asymmetric scaling

Two parameter counts now matter for any MoE model: total parameters (everything in the model file, including all experts) and active parameters (what actually computes per token). Total dictates memory cost. Active dictates FLOPs per token. The ratio between them — active/total — is the central design knob of modern frontier models. Mixtral 8x7B is ~28% (13/47); DeepSeek V3 671B-A37B is ~5.5% (37/671); some MoE variants push as low as ~3%. The implication is profound: you can build a model with the knowledge capacity of a 671B model while paying the per-token inference cost of a 37B model. This section walks the math, the DeepSeek V3 case study, and the surprising follow-on: at extreme MoE ratios, the bottleneck shifts from compute to memory bandwidth — a fundamentally different optimisation problem from dense LLMs.

The two parameter counts

MoE parameter accounting (one layer): Total params per layer: N · expert_size + attention + router + norm \___________/ \______________________________/ ~all of it "the constant cost" Active params per layer: k · expert_size + attention + router + norm \___________/ the gated cost Total params >> Active params by a factor of N/k. Memory cost (HBM): total · (bytes per param) Compute cost (FLOPs/tok): 2 · active (for forward; backward is 4×) KV cache size: same as dense at active size (attention dim unchanged)

The split is crucial because the two parameter counts trade against TOTALLY DIFFERENT hardware resources:

DeepSeek V3 — the extreme MoE case

DeepSeek V3 (2024) is the headline MoE result of the post-Mixtral era. The configuration:

DeepSeek V3 configuration: d_model = 7168 n_layers = 61 (more than Llama 3 70B's 80, but split structure) context_len = 131072 (128K via YaRN extrapolation) # MoE specifics — different from Mixtral n_experts (routed) = 256 ← much more than Mixtral's 8 n_experts (shared) = 1 ← always-active expert per layer top_k = 8 ← top-8 from 256 routed experts active per token expert_size = ~2.3B params each (smaller experts, more of them) router_params = d · N_routed = 7168 · 256 ≈ 1.8M # First 3 layers and last layer are DENSE (no MoE) — empirical finding # MoE layers (57 of them): # Parameter accounting: Total params: 671 B Active params/token: 37 B (~5.5% of total) Active/Total ratio: 1 / 18

The numbers: 671B parameters total, 37B active. That’s an 18× capacity-to-compute ratio — vastly beyond Mixtral’s 4×. Inference cost is similar to a dense 37B model. Quality on benchmarks is GPT-4 class (~88% MMLU, ~75% GSM8K).

What design choices enable the 18× ratio?

  1. More, smaller experts. DeepSeek V3 uses 256 experts of ~2.3B each instead of Mixtral’s 8 of ~5.5B. More experts = finer specialisation, but each individual expert is smaller, so even at top-8 you’re using only 1/32 of expert capacity per layer.
  2. Shared expert. One expert that ALWAYS runs (no routing decision), providing a “general purpose” base. The routed experts add specialisation on top.
  3. Auxiliary-loss-free balancing. Standard load-balancing aux losses (§17.3) hurt training; DeepSeek introduced a bias-update scheme that balances without an explicit loss term — keeping the routing more specialised.
  4. MLA (Multi-head Latent Attention). Attention is also factorised to reduce KV cache cost, which matters more at higher MoE total counts.

Shared experts are a recurring innovation in 2024 MoE designs. They guarantee that EVERY token has access to some baseline FFN capacity, taking the pressure off the routing decision to be perfect.

— think, then check —

Mixtral 8x7B: 8 experts, top-2 routing. Active = (2/8) · expert_params + attention = roughly 1/4 of total = ~13B active out of 47B.

DeepSeek V3: 256 routed experts + 1 shared, top-8 routing. Active = (8/256) · routed_expert_params + 1 · shared_expert_params + attention = roughly 1/18 of total = 37B active out of 671B.

The design choices that enable the high ratio:

  1. Smaller experts. DeepSeek’s experts are ~2.3B each vs Mixtral’s ~5.5B. Smaller experts let you have more of them at the same total parameter budget, with finer specialisation. Each routed expert contributes less per activation but the model has more variety.
  2. Higher N, modest k. Going from N=8 to N=256 with k going from 2 to 8 keeps the “experts per token” similar in absolute terms (2 vs 8 — only 4× more), but the “experts per token / total experts” goes from 1/4 to 1/32. The sparsity is much higher.
  3. Shared expert. The 1 always-active expert provides a baseline that handles general patterns, so the 8 routed experts can specialise more. Without the shared expert, top-8/256 routing would likely lose quality.
  4. Auxiliary-loss-free balancing. Standard balancing aux losses hurt specialisation; DeepSeek’s bias-update scheme balances without forcing the router to “spread tokens fairly”.

The trade-off:

  • Memory cost: 671B params in HBM. Needs ~340 GB fp16 (or ~190 GB quantized), shardable across 4-8 H100s.
  • Pretraining complexity: more experts means more routing decisions to learn; harder to balance; can collapse if not carefully designed.
  • Inference engineering: requires all-to-all expert routing at inference time (each GPU holds a subset of experts; tokens must be routed across GPUs).
  • Marginal returns: going beyond 18× sparsity (e.g., 1/64) tends to plateau in quality. There’s a quality-vs-sparsity efficient frontier; DeepSeek seems to be near it.

The economic win: 37B-class inference cost for 88% MMLU performance. Before MoE, that was impossible — you needed a dense 70B+ model.

Memory bandwidth: the hidden constraint

At extreme MoE ratios, the inference bottleneck shifts from compute to memory bandwidth. Walk through why.

Per-token inference cost (dense model, dimension N_active): FLOPs/token: 2 · N_active (forward; multiply by 4 for backward) Mem read: N_active · sizeof(param) (each param read once per token) Arith intensity: 2 · N_active / (N_active · 2 bytes) = 1 FLOP/byte (catastrophically low; bandwidth-bound) But with KV cache + activation reuse during decoding, dense inference is typically ~2-3 FLOPs/byte → still bandwidth-bound on H100 (which has 3 TB/s HBM and 1 PFLOP/s — ratio of ~333 FLOPs/byte at peak). MoE inference (per token): FLOPs/token: 2 · N_active (only active experts compute) Mem read: N_active · sizeof(param) (only active expert weights loaded) PLUS routing: one all-to-all across GPUs to gather expert outputs The catch: in batched inference, DIFFERENT TOKENS in the batch may route to DIFFERENT EXPERTS. To run a batch of B tokens through their assigned experts, you may need to gather/scatter — moving tokens between GPUs. This communication DOMINATES the wall-clock at scale for some batch sizes.

The decoder-side bottleneck for MoE looks like: routing decision (small), gather tokens by expert, run each expert on its assigned tokens (this is compute), scatter outputs back. The “gather” and “scatter” are All-to-All communications — the most expensive collective for GPU clusters.

This is why MoE inference engines (vLLM, SGLang, TensorRT-LLM) have specialised expert parallelism strategies that batch tokens routing to the same expert together, even across requests. The compute is then efficient, but the communication overhead is the bottleneck — fundamentally different from dense LLM serving.

— think, then check —

Configuration: 4× H100 (4 × 80 GB = 320 GB total HBM, 4 × 3 TB/s = 12 TB/s aggregate bandwidth).

Memory: 671B params in fp16 = 1.3 TB. Doesn’t fit. Need either:

  • Lower precision: fp8 → 670 GB (still doesn’t fit). int4 quantization → 335 GB (fits with room).
  • Or expert parallelism: split the 256 routed experts across 4 GPUs (64 experts/GPU). Each GPU holds: ~64 experts of ~2.3B each = 147 GB + attention (~8 GB) + shared expert (~2.3B) — ~160 GB. Doesn’t fit either; need quantization on top.

Practical setup: 4× H100 with experts at fp8 (1 byte/param), attention at bf16. Total memory: ~700 GB / 4 = 175 GB per GPU. Still tight. More commonly DeepSeek V3 is deployed on 8× H200 (8 × 141 GB).

Per-token compute:

37B active params × 2 FLOPs/param = 74 GFLOPs per token (forward decode).

4× H100 aggregate: 4 × 1 PFLOP = 4 PFLOPs/s.

Compute time: 74 GFLOPs / 4 PFLOPs/s ≈ 18 μs per token. Negligible.

Per-token memory bandwidth:

~37B params @ 1 byte (fp8) = 37 GB to load from HBM per token.

4× H100 aggregate bandwidth: 4 × 3 TB/s = 12 TB/s.

Read time: 37 GB / 12 TB/s ≈ 3.1 ms per token. This is where time goes.

Per-token communication:

Tokens have to be routed to their assigned experts across the 4 GPUs. For top-8 routing, each token may have experts spread across all 4 GPUs. All-to-all communication is required per layer.

57 MoE layers × 4-way all-to-all per layer × per-batch overhead. For batch=1, this can be ~1-3 ms per layer with NVLink-class interconnect, but the routing communication is typically batched and overlapped with compute.

Total per-token time: ~5-10 ms for batch=1 decode. ~200 tokens/sec.

Where time actually goes:

  1. HBM bandwidth for loading active params: 3-5 ms (the dominant term).
  2. All-to-all routing between GPUs: 1-3 ms (only at higher batch sizes).
  3. Actual compute: 18 μs (effectively zero).

The key insight: MoE inference at the active-param size is bandwidth-bound, not compute-bound. The 18× sparsity ratio buys you “37B inference cost” — meaning you’re loading 37B of param bytes per token, regardless of how many experts the model actually contains in HBM. Compute scales with active params; bandwidth scales with active params; memory CAPACITY scales with total params.

— think, then check —

The H100 numbers: 80 GB HBM, 3 TB/s bandwidth, 1 PFLOP/s in bf16. Arithmetic intensity (FLOPs per byte loaded) needs to be ≥ 333 to fully utilise compute.

Training:

Training does forward + backward. Each parameter is touched at least once per direction, contributing 2 + 4 = 6 FLOPs per parameter. With activation reuse across the layer, and batch sizes of millions of tokens, the arithmetic intensity becomes:

(6 FLOPs/param) × (millions of tokens per batch) / (param bytes) ≈ tens of thousands. WAY above 333. Training is compute-bound, fully utilising the GPU’s FLOPs.

Inference (decode):

For batch=1 token-by-token generation, each parameter is loaded once and used for ~2 FLOPs (forward only). Arithmetic intensity = 2 FLOPs per 2 bytes = 1 FLOP/byte. WAY below 333. Inference is bandwidth-bound — the GPU spends 99% of time waiting for HBM and only 1% computing.

This is the “low arithmetic intensity of decode” problem and it’s why inference economics are so different from training.

MoE inference:

For top-k MoE at active size N_active:

- FLOPs/token: 2 · N_active (same per-active-param as dense).

- Bytes/token: N_active · sizeof(param) (only active expert weights loaded).

Arithmetic intensity: 2 / 2 = 1 FLOP/byte. Same as dense.

So MoE doesn’t change the arithmetic intensity — it’s still bandwidth-bound. What MoE changes is that you only PAY the bandwidth for the active params. Memory CAPACITY scales with total; bandwidth COST scales with active.

The implication for deployment:

An MoE model with 18× sparsity has 18× the memory cost of its active-size equivalent (a 37B dense), but identical per-token bandwidth and FLOPs. So inference cost per token is comparable to the active size. Memory cost is comparable to total. Storage is the multiplier; compute and bandwidth are the per-token costs.

The implication for hardware: future ML accelerators should optimize for MEMORY CAPACITY at acceptable bandwidth, not for raw FLOPs. The H100/B200/MI300X trends bear this out — each successive chip generation has ~2× the memory capacity but only ~1.3× the FLOPs.

Next: §17.3 — Expert parallelism and load balancing. The all-to-all communication pattern, why expert collapse is the canonical MoE failure mode, and the load-balancing techniques (aux loss, sequence-balance, DeepSeek’s loss-free approach) that prevent it.