GPU memory hierarchy (HBM↔SRAM, the FlashAttention motivation generalized), Tensor Cores, TPUs, Apple Silicon, the roofline model.
GPUs have a multi-tier memory hierarchy: HBM (slow, big), L2 cache (medium), SRAM (fast, tiny). The "roofline model" maps each operation to a regime: memory-bound (limited by bandwidth) or compute-bound (limited by FLOPs). On an H100, the boundary is ~300 FLOPs/byte — and almost every LLM operation EXCEPT large GEMM falls below this line. This is why LLM inference is bandwidth-bound; this is why FlashAttention matters; this is why hardware design prioritizes bandwidth over peak FLOPs.
Modern GPU compute is dominated by tensor cores — specialised hardware that does 16×16 (or 32×16) matmul per cycle, at ~10× the rate of general-purpose CUDA cores. fp8 (and now fp4) further compress data to get 2-4× higher effective FLOPs by quantizing at the matmul boundary. NVLink and NVSwitch provide 1.8 TB/s GPU-to-GPU bandwidth — needed because LLM training is multi-GPU. This section walks the H100 → B200 jump and the trends.
NVIDIA is not the only game. TPUs (Google) use a systolic array architecture fundamentally different from GPUs. Apple Silicon has unified memory + a Neural Engine that changes the cost calculus for on-device LLMs. AMD MI300X has more HBM than H100 (~192GB) and competes on price/perf. AWS Trainium and Cerebras and Groq exist too. This section explains the architectural differences and why NVIDIA still wins the frontier.
← ALL CHAPTERS