THE HARDWARE SUBSTRATE
Section 21.2
02

Tensor cores, fp8, NVLink

When NVIDIA quotes “1 PFLOPs” for an H100, the number isn’t coming from the general-purpose CUDA cores you might know from earlier GPUs — it’s coming from tensor cores, specialised hardware blocks that do small matmul operations (16×16 or 16×8 tiles) in a single instruction. Tensor cores are the reason modern GPU FLOPs are 10× higher than general FLOPs at the same transistor count. Stack on top of this fp8 and fp4 — lower-precision data types that further multiply effective FLOPs — plus NVLink for fast GPU-to-GPU communication, and you have the trio that defines the modern LLM training and inference stack. This section walks each piece and the H100 → B200 generation jump.

Tensor cores — what they actually do

A tensor core (H100 Hopper architecture) does one operation per cycle: D = A · B + C where A, B, C, D are small matrices Specifically (for bf16 input → fp32 accumulate): A: 16 × 16 matrix in bf16 (256 elements) B: 16 × 8 matrix in bf16 (128 elements) C: 16 × 8 matrix in fp32 (128 elements) D: 16 × 8 matrix in fp32 (128 elements) The hardware computes 16·16·8 = 2048 FMAs (= 4096 FLOPs) in ONE cycle. H100 has 528 tensor cores, runs at ~1.5 GHz: 528 cores × 4096 FLOPs/cycle × 1.5e9 cycles/s ≈ 3.2 PFLOPs (sparse) or ~1 PFLOPs (dense bf16)

Tensor cores are the reason you can buy a GPU with “1 PFLOPs of compute.” The general CUDA cores on the same chip provide ~50 TFLOPs (20× less) — so most of the chip’s flop budget is in the tensor cores.

For tensor cores to be usable, the matmul must:

  1. Be large enough — minimum dimensions ~16 along each axis. Tiny matmuls bypass tensor cores entirely.
  2. Have the right dtype — bf16, fp16, int8, fp8, etc. fp32 inputs aren’t tensor-core eligible (you’d downcast to bf16 first).
  3. Be properly tiled — the kernel must emit the specific mma.sync instructions (or wmma equivalents). cuBLAS, CUTLASS, Triton, and torch.compile all do this; hand-written naive matmul kernels often don’t.

fp8 and fp4 — the new precisions

Lower precision means:

H100 tensor core throughput by precision: bf16: 989 TFLOPs (dense) fp8: 1979 TFLOPs (2× bf16 — 8-bit values) int8: 1979 TFLOPs (same as fp8) With 2:4 sparsity: 2× of above (~3960 TFLOPs for fp8) B200 jumps further: bf16: 2.25 PFLOPs (2.3× H100) fp8: 4.5 PFLOPs (2.3× H100 fp8) fp4: 9 PFLOPs (NEW in Blackwell) With sparsity: 2× of above The fp4 throughput is what's powering the headline "20 PFLOPs B200" numbers. But fp4 is only usable in INFERENCE for now — training in fp4 doesn't yet converge for most architectures.

The fp8 story is interesting. Two formats exist:

Production training in fp8 typically uses E4M3 for the forward pass and E5M2 for the backward pass, with per-tensor scale factors to bring values into the format’s representable range. Combined: ~2× faster training than bf16 with comparable convergence.

fp8 E4M3 / E5M2 are the formats that enable practical fp8 training. NVIDIA’s “Transformer Engine” library handles the per-layer scale management automatically.

— think, then check —

The gap:

H100 has TWO compute paths: general CUDA cores (~50 TFLOPs total) and tensor cores (~1000 TFLOPs total). A kernel that only uses general CUDA cores is limited to the 50 TFLOPs — about 5% of the chip.

cuBLAS dispatches large matmuls to tensor cores, getting 750+ TFLOPs — ~75% of peak.

Tensor cores’ specific role:

One H100 tensor core instruction performs:

D = A · B + C, where A is 16×16, B is 16×8, C is 16×8 (bf16 inputs, fp32 accumulate).

That’s 16·16·8 = 2048 multiply-adds = 4096 FLOPs per cycle per tensor core.

To use tensor cores: emit mma.sync or wmma::mma_sync PTX instructions (or use a high-level API like CUTLASS or Triton that emits them).

What makes a kernel tensor-core-friendly:

  1. Right dtype: bf16, fp16, int8, fp8, fp4 (depending on generation). fp32 is NOT tensor-core eligible on H100 (would need bf16 downcast first).
  2. Sufficient size: matrix dims ≥ 16 along each axis. Smaller dims fall back to general CUDA cores.
  3. Aligned addresses: tensor cores work on 16-byte aligned data. Misaligned loads waste cycles.
  4. Tiled to SRAM: the A, B, C tiles must fit in SRAM/registers. A naive load-from-HBM-each-step kernel can’t feed tensor cores fast enough.
  5. Proper accumulation: tensor cores accumulate in fp32 (or higher precision than inputs). Kernels that don’t preserve this accuracy lose quality.
  6. Pipelining: overlap load + compute. Tensor core’s 4096 FLOPs/cycle is wasted if data arrives slowly.

cuBLAS implements all of these. CUTLASS is a NVIDIA open-source library that exposes the building blocks for custom kernels. Triton (next chapter) is a higher-level DSL that emits tensor-core kernels automatically.

Hand-written naive kernels usually miss several of these — explaining the 15× gap.

A single H100 has 80 GB of HBM. A 70B model in fp16 needs 140 GB. You can’t fit Llama 2 70B on one H100; you need multi-GPU.

For training, you need 4-16K H100s coordinating. For inference, often 2-8 GPUs per request. Either way, you need fast GPU-to-GPU communication.

NVIDIA's interconnect stack: NVLink (chip-to-chip, intra-node): H100: 900 GB/s per GPU (3rd gen NVLink) B200: 1.8 TB/s per GPU (5th gen) Within an 8-GPU node: each GPU has 900 GB/s ↔ each other GPU NVSwitch (within-rack, multi-node): Connects all GPUs in a "DGX SuperPOD" with full bandwidth Allows any GPU to talk to any other at NVLink speed InfiniBand (between racks): 400 Gbps per port (~50 GB/s) Used for multi-node training: parameter sync across racks PCIe (CPU-to-GPU): Gen5 x16: 64 GB/s Slow — host-to-GPU transfers are bandwidth-limited

The NVLink number (900 GB/s) is huge compared to PCIe (64 GB/s) — about 14× faster. This is why “multi-GPU training in a single node” is dramatically faster than “multi-GPU across PCIe-connected nodes”: NVLink doesn’t hit the PCIe bottleneck.

For multi-node training (when you need more than 8 GPUs), NVLink ends at the node boundary. Inter-node traffic goes over InfiniBand at ~50 GB/s — still fast, but ~18× slower than NVLink. This is why training-job topology matters: certain operations (all-reduce within a TP group) want to stay intra-node; others (DP all-reduces) can spread across nodes.

— think, then check —

Why everyone doesn’t use it (yet):

fp8 has 7-bit precision (E4M3) or 5-bit precision (E5M2) plus exponent. Compared to bf16’s 7-bit mantissa and 8-bit exponent, fp8 has:

  • Much narrower range (E4M3 maxes at ±240; bf16 maxes at ±3.4e38).
  • Lower precision per representable value.

This narrow range means values can OVERFLOW (gradients spike) or UNDERFLOW (small gradients become zero) easily. Without intervention, fp8 training diverges within a few hundred steps.

What can fail:

  • Activation overflow: some layers produce activations > 240. In fp8, they saturate to ±240, losing information.
  • Gradient underflow: small gradients (~1e-5) become 0 in fp8. Effective learning rate drops to zero for those parameters.
  • Loss divergence: accumulated errors from fp8 compound; the model fails to converge.
  • Optimizer state corruption: Adam’s m, v stats need higher precision than fp8 provides.

NVIDIA Transformer Engine’s solution:

  1. Per-tensor scaling: before each matmul, compute a SCALE FACTOR for the tensor: scale = max(|x|) / fp8_max. Multiply the tensor by 1/scale before the matmul; multiply the output by scale after. This brings the values into the representable range.
  2. Mixed format: use E4M3 for forward pass (weights, activations) where precision matters more than range. Use E5M2 for backward pass (gradients) where range matters more than precision. Switch per direction.
  3. Master copy in higher precision: keep weights as bf16 or fp32 master. Cast to fp8 only at matmul time. Optimizer state (Adam m, v) stays fp32.
  4. Recipe tuning: empirically derived combinations of warmup, learning rate, and gradient clipping that make fp8 stable.
  5. Selective fp8: some layers (LayerNorm, final layer) stay in bf16. Most matmuls use fp8. Per-layer dtype selection.

Empirical state:

NVIDIA Transformer Engine + careful tuning achieves fp8 training that matches bf16 perplexity within ~0.01 on Llama-class models. Speedup: 1.5-2× vs bf16.

Production use: GPT-5 reportedly trained partially in fp8. Anthropic’s Claude pretraining uses fp8 selectively. Meta’s research papers describe fp8 training.

The bar is being lowered each year as tooling improves, but as of 2025, fp8 training requires the Transformer Engine library; pure bf16 is still the “safe default” for production training.

The H100 → B200 jump

H100 (Hopper, 2022) vs B200 (Blackwell, 2024) — headline numbers: H100 B200 Ratio HBM capacity 80 GB 192 GB 2.4× HBM bandwidth 3.35 TB/s 8 TB/s 2.4× bf16 FLOPs 989 TFLOPs 2250 2.3× fp8 FLOPs 1979 TFLOPs 4500 2.3× fp4 FLOPs N/A 9000 NEW NVLink bandwidth 900 GB/s 1.8 TB/s 2× Roofline ridge (bf16) ~300 F/B ~280 F/B ~constant Power 700 W 1000 W 1.4× Notable: ridge AI barely changed — the architecture is scaling balanced. fp4 is the new precision; production training in fp4 isn't yet routine. Per-dollar: B200 is ~2-3× more expensive than H100 per unit but offers ~2.5× the throughput. ROI is positive at modern LLM workloads.

The Blackwell architecture’s key innovations beyond raw scale:

— think, then check —

Setup:

16K H100s = 2000 nodes × 8 GPUs/node. Each node has 8 GPUs connected by full NVLink mesh. Nodes are interconnected by InfiniBand (or similar high-speed inter-node fabric).

The 70B model is partitioned via tensor parallelism (TP) and data parallelism (DP) — and maybe pipeline parallelism (PP) too.

Typical 70B-class split:

  • Tensor parallel: TP=8 (split each layer’s matmul across the 8 GPUs of one node). NVLink-intensive.
  • Pipeline parallel: PP=4 (split the 80 layers into 4 sequential pipeline stages). Inter-node communication.
  • Data parallel: DP = 16K / (8 × 4) = 500 ranks. Inter-node.

Data flow per gradient step:

1. Forward pass:

Within each TP group (8 GPUs in a node): every linear layer needs an all-reduce of partial outputs. ~140 GB / 8 = 17.5 GB per all-reduce, ~32 layers × 2 all-reduces/layer = 64 all-reduces. Total: ~1.1 TB of NVLink traffic per forward step. At 900 GB/s, takes ~1 second.

Between PP stages: each rank sends its activations to the next PP stage. Crosses node boundaries → InfiniBand. ~few GB per microbatch.

2. Backward pass:

Symmetric to forward. ~2× the data movement.

3. All-reduce for DP:

At the end of the step, gradients across the 500 DP ranks are all-reduced. ~140 GB of gradients per rank. The all-reduce is a tree/ring operation over 500 ranks across the InfiniBand fabric. Heavy: takes seconds.

Why the assignment matters:

If TP were split ACROSS nodes (instead of within), every layer’s all-reduce would use InfiniBand (50 GB/s) instead of NVLink (900 GB/s) — 18× slower. A single step would take 30+ seconds instead of 6.

If DP were ALSO confined within nodes, the DP all-reduce would still be intra-node — fast — but you’d be limited to 8-way DP. With 16K GPUs you’d run out of TP/PP/EP dimensions; you’d waste GPUs.

The optimal split puts the HIGHEST-BANDWIDTH operations (TP all-reduces, per-layer comms) on the FASTEST link (NVLink intra-node). The LOWER-BANDWIDTH ops (DP all-reduce, once per step) go on InfiniBand.

This is why pre-training engineers obsess over “the parallelism strategy” — it’s not just about dividing the model; it’s about matching the comms hierarchy to the bandwidth hierarchy.

Next: §21.3 — TPUs, Apple Silicon, AMD MI300X. The non-NVIDIA landscape: why TPUs are different, what Apple Silicon’s unified memory gets you, and where AMD stands.