The hardware substrate

§1 GPU memory hierarchy + the roofline model
GPUs have a multi-tier memory hierarchy: HBM (slow, big), L2 cache (medium), SRAM (fast, tiny). The "roofline model" maps each operation to a regime: memory-bound (limited by bandwidth) or compute-bound (limited by FLOPs). On an H100, the boundary is ~300 FLOPs/byte — and almost every LLM operation EXCEPT large GEMM falls below this line. This is why LLM inference is bandwidth-bound; this is why FlashAttention matters; this is why hardware design prioritizes bandwidth over peak FLOPs.
§2 Tensor cores, fp8, NVLink
Modern GPU compute is dominated by tensor cores — specialised hardware that does 16×16 (or 32×16) matmul per cycle, at ~10× the rate of general-purpose CUDA cores. fp8 (and now fp4) further compress data to get 2-4× higher effective FLOPs by quantizing at the matmul boundary. NVLink and NVSwitch provide 1.8 TB/s GPU-to-GPU bandwidth — needed because LLM training is multi-GPU. This section walks the H100 → B200 jump and the trends.
§3 TPUs, Apple Silicon, AMD MI300X — the non-NVIDIA landscape
NVIDIA is not the only game. TPUs (Google) use a systolic array architecture fundamentally different from GPUs. Apple Silicon has unified memory + a Neural Engine that changes the cost calculus for on-device LLMs. AMD MI300X has more HBM than H100 (~192GB) and competes on price/perf. AWS Trainium and Cerebras and Groq exist too. This section explains the architectural differences and why NVIDIA still wins the frontier.