TPUs, Apple Silicon, AMD MI300X — the non-NVIDIA landscape
If you read ML papers from 2017-2024, you might think NVIDIA is the entire compute industry. It isn’t. Google has shipped TPUs since 2016 with a fundamentally different architecture (systolic arrays). Apple Silicon has unified memory + a dedicated Neural Engine, changing cost economics for on-device LLMs. AMD MI300X matches H100 specs on paper with more HBM. AWS Trainium / Inferentia are AWS’s in-house alternatives. Cerebras builds wafer-scale chips. Groq targets ultra-fast inference. This section walks the major non-NVIDIA options, their architectural differences, and the honest question of why NVIDIA still wins the frontier despite all this competition.
TPUs — systolic arrays vs SIMT
Google’s TPU is the longest-running non-NVIDIA ML accelerator. The architectural difference is real:
TPUs are how Google trains Gemini and serves Search’s ML. The architecture is genuinely different — a systolic array doesn’t have the same “many independent threads” model as a GPU.
The TPU bet is structural: matmul is the dominant LLM operation, so build hardware that does matmul exceptionally well, accept that everything else (custom kernels, irregular ops) is slower. Empirically: TPUs match or beat GPUs on training throughput per dollar for transformer workloads.
The catch: TPUs don’t run unmodified PyTorch. You need JAX or TensorFlow with XLA compilation. The tooling ecosystem is smaller. For a frontier lab investing $100M in a training run, the cost of porting is small; for everyone else, it’s a barrier.
Systolic array structure:
A 2D grid of arithmetic units, each connected to its immediate neighbors. Imagine a 128×128 grid of FMA units, where:
- Input A flows left-to-right across each row.
- Input B flows top-to-bottom down each column.
- Each unit at position (i, j) sees one element of A (at time t-i) and one of B (at time t-j); multiplies them; adds to its local accumulator.
- Output accumulators are pulled out at the end of the computation.
For a matmul C = A · B with A as (m, k) and B as (k, n): the array computes one C tile of size (m_grid, n_grid) in m_grid + k + n_grid cycles. After warmup, the array produces results at one per cycle per output element.
Why arithmetic intensity is high:
Each element of A is LOADED ONCE from memory and flows through n_grid units (one per column). Each element of B similarly flows through m_grid units. Each load enables m_grid · k · n_grid / (m_grid + k + n_grid) ≈ k FLOPs (when m_grid, n_grid are comparable).
So per byte loaded, ~k FLOPs are computed. For typical k=4096 in LLMs: AI ≈ 4096 FLOPs/byte — far above any reasonable hardware ridge.
Effectively, the systolic array USES EACH LOADED ELEMENT MANY TIMES before discarding it. This is the same “tiling” idea as GPU matmul, but baked into the hardware geometry.
Comparison to GPU:
GPU’s SIMT model: each thread loads its own elements, uses them, discards. Reuse comes from SRAM caching: thread 1 might load A[i, j]; thread 2 might find it in cache. But there’s no GUARANTEED reuse — depends on cache hit rates.
Systolic array: reuse is STRUCTURAL. Each element CAN’T be reloaded; it flows through the grid by design. The arithmetic intensity is high by architecture, not by software optimisation.
The trade-off:
Systolic arrays are SPECIALISED. They do matmul brilliantly but struggle at irregular operations (attention’s softmax, custom kernels, branches). GPUs are flexible but require careful kernel design to achieve high intensity.
For pure matmul workloads (large training runs), TPUs often beat GPUs in throughput/$.
For diverse workloads (research, fine-tuning, exploration), GPUs win on flexibility.
Apple Silicon — unified memory changes the game
The structural difference for Apple Silicon (M2 Pro/Max/Ultra, M3 family, M4 family):
Unified memory is Apple’s secret weapon for LLM inference. The headline: a single Mac Studio with M2 Ultra can run Llama 3 70B in fp16 — natively, no quantization, no multi-GPU. No PC at any price can do that without multi-GPU + PCIe overhead.
For training, Apple Silicon is much less compelling: lower peak FLOPs (~50 TFLOPs vs H100’s 1000), no NVLink-class interconnect for multi-machine, weaker tooling. But for INFERENCE — particularly very large models — Apple Silicon is genuinely competitive.
(a) 2× H100 PC ($80K, ~1500W):
- Memory: 2× 80 GB = 160 GB HBM. Llama 3 70B fp16 (140 GB) fits with room for KV cache.
- Bandwidth: 2× 3.35 TB/s aggregate. Extremely fast.
- Throughput at batch=1: ~70 tokens/s.
- Inter-GPU comm: NVLink at 900 GB/s.
- Software: PyTorch + CUDA fully supported.
- Catch: $80K is hard to justify for individual use. Used market is also crazy. Powers ~1500W continuously.
(b) Mac Studio M2 Ultra ($7K, ~370W):
- Memory: 192 GB unified. Llama 3 70B fp16 fits.
- Bandwidth: 800 GB/s — about 1/4 of H100 but applied to a model that fits on ONE device.
- Throughput at batch=1: ~20-30 tokens/s for 70B.
- Software: MLX (Apple’s), llama.cpp, Ollama all work; PyTorch via MPS backend (rough but improving).
- Plus: quiet, low power, desk-friendly. The “macOS workstation” feel.
- Catch: lower peak performance than H100. Can’t train large models efficiently.
(c) 4× RTX 4090 PC ($15K, ~1800W):
- Memory: 4× 24 GB = 96 GB total. Llama 3 70B fp16 (140 GB) DOESN’T FIT. Must use quantization (int4) or smaller model.
- With int4 quantization (Q4_K_M, ~42 GB): fits, with KV cache room.
- Bandwidth: 4× 1 TB/s — but per-GPU. PCIe-connected, not NVLink. PCIe = 64 GB/s bottleneck per pair.
- Throughput at batch=1: ~30-50 tokens/s for 70B at int4 (limited by inter-GPU comm).
- Software: PyTorch + CUDA fully supported. But multi-GPU coordination is tricky on PCIe.
- Plus: more flexible than Mac; can also run other GPU workloads (gaming, rendering, fine-tuning small models).
- Catch: needs powerful PSU, cooling, requires technical setup. Loud.
The picks:
- For pure inference, single-user, “I want to run frontier models locally”: Mac Studio M2 Ultra. The simplest setup with the best fit-the-model story.
- For inference + occasional small-model fine-tuning + gaming: 4× RTX 4090 PC.
- For serious development (full fine-tuning, multi-GPU optimisation): 2× H100 PC (but only if budget allows; otherwise rent cloud H100s).
The deep point: Apple’s unified memory created a new product category — “local frontier LLM inference at $7K” — that didn’t exist before. For users whose primary need is inference, this is increasingly competitive with traditional GPU setups.
AMD MI300X — the credible alternative
AMD’s MI300X is the closest direct competitor to H100:
The MI300X case is structural — the 192 GB HBM is genuinely advantageous for many workloads. For pure inference of a 70B model, one MI300X is simpler and cheaper than two H100s. The catch is software maturity; ROCm has lagged CUDA by ~2 years for the past decade. This gap is closing as AMD invests, but it’s real today.
Why NVIDIA still wins
After all this, why is NVIDIA still dominant?
The actual obstacles:
1. CUDA dominance.
Every ML library, every framework, every tool primarily targets CUDA. Porting work is substantial: getting equivalent PyTorch performance on MI300X took AMD ~3 years of full-time effort. JAX on TPU is well-supported but doesn’t have the ecosystem of PyTorch.
Cost to switch: estimated $1-10M+ of engineering investment per major lab to port stack to non-NVIDIA. Not impossible but high.
2. Performance at frontier scale.
NVIDIA’s mature kernels (cuBLAS, FlashAttention, cuDNN) extract 60-70% of peak. AMD’s ROCm extracts 30-50% on the same workloads. TPU’s XLA extracts 50-70% but only on specific workloads.
For training a $100M model, even a 20% efficiency gap means $20M wasted. NVIDIA wins on “usable FLOPs” not just “claimed FLOPs.”
3. Networking integration.
NVIDIA + Mellanox (now owned by NVIDIA) is the “end-to-end” stack. Mixing NVIDIA GPUs with non-NVIDIA networking is risky; pure-NVIDIA cluster is predictable.
4. Talent / knowledge.
Most ML engineers know CUDA. Performance optimisation, debugging, kernel writing — all the lore is NVIDIA-centric. Hiring people who can extract performance from MI300X is much harder.
5. Risk aversion.
Frontier labs aren’t optimising for “lowest cost”; they’re optimising for “lowest risk of training failure.” NVIDIA’s predictability is worth a 20-30% price premium.
What would cause a shift:
- Genuine performance gap at frontier. If AMD or someone delivered 3× better throughput per dollar AND tooling caught up to within 10% of CUDA, labs would switch. AMD is not there yet (closer to 1.2-1.5× at best with software maturity catching up).
- Geographic / supply chain forcing. If NVIDIA’s supply chain were disrupted (export controls, manufacturing issues), labs would have to use alternatives.
- Open-source CUDA equivalent. A community-maintained CUDA-compatible runtime that works on multiple hardware (Triton is partly this, but doesn’t yet cover everything).
- Hardware-specific architecture optimisation. A model architecture that maps PERFECTLY to TPU’s systolic array (or to Cerebras’s wafer-scale) might give 2-3× speedup over NVIDIA for that specific architecture. So far no such “TPU-native” or “Cerebras-native” model has emerged.
What’s actually happening in 2025:
- Frontier: still NVIDIA (95%+).
- Mid-tier training: NVIDIA majority, AMD and TPU minority (5-15%).
- Inference for very large models: AMD MI300X gaining share due to 192 GB HBM.
- On-device inference: Apple Silicon dominant, with Qualcomm chasing.
- Hyperscaler in-house: Google (TPU), AWS (Trainium/Inferentia), Microsoft (announced Maia) — building alternatives for capex reasons but not displacing NVIDIA in commercial offerings.
The honest forecast:
NVIDIA dominance erodes 5-10% over the next 2-3 years. The decline is from the bottom (commodity inference, on-device) rather than the top. The frontier stays NVIDIA until either AMD ships a much-better-than-H100 chip or NVIDIA stumbles. Neither looks imminent.
END OF CH.21 — The hardware substrate.
§1 (memory hierarchy + roofline: H100 ridge at ~300 FLOPs/byte, almost everything below) ·
§2 (tensor cores, fp8, NVLink: how H100 reaches 1 PFLOPs and what’s needed to use it) ·
§3 (TPUs, Apple Silicon, AMD MI300X: the alternatives and the genuine cases where each wins).
Next: Ch.22 — Runtimes & frameworks. PyTorch’s dispatch stack, what a CUDA kernel actually is, the GGUF/ONNX/safetensors formats.