RUNTIMES & FRAMEWORKS
Section 22.1
01

PyTorch eager vs compiled — the dispatch stack

PyTorch is the framework most ML practitioners encounter first, and almost no one understands what’s actually happening when they call torch.matmul(a, b). The chain is: Python interpreter → PyTorch Python binding → PyTorch C++ dispatcher → operation router → backend kernel (cuBLAS / FlashAttention / aten implementation) → CUDA driver → GPU. Six layers of indirection. This section walks the dispatch stack, explains the difference between PyTorch’s default EAGER mode and the newer COMPILED mode (torch.compile / Inductor), and shows where typical performance gains and losses come from.

The Python → kernel path

When you write torch.matmul(a, b), here’s what happens:

1. Python interpreter: - Parses the call. - Looks up torch.matmul in the torch module. 2. Python binding (pybind11 layer): - Converts Python args to C++ types. - Calls into torch::ops::aten::matmul(). 3. PyTorch dispatcher (the actual ROUTER): - Receives (a, b) tensors. - Reads each tensor's metadata: dtype, device, layout. - Looks up the kernel registered for (matmul, bf16, CUDA, strided). - Dispatches to that kernel. 4. Operation kernel (e.g., aten::matmul_cuda): - Plans how to split the work. - Falls through to cuBLAS for actual matmul, OR to a custom kernel. - Manages CUDA streams, memory allocation, scheduling. 5. cuBLAS / FlashAttention / custom CUDA kernel: - Issues actual GPU instructions. - Reads inputs from HBM, computes, writes to HBM. 6. CUDA driver → GPU: - Hardware does the work. - Result becomes available asynchronously.

PyTorch dispatcher is the central abstraction. Every operation goes through it; the metadata-based routing is what makes “the same Python code work on CPU and GPU” possible.

Eager mode — the default

In eager mode, every Python operation triggers an immediate kernel launch:

Eager mode example: x = torch.randn(1024, 1024, device='cuda', dtype=torch.bfloat16) y = torch.relu(x) # kernel 1: ReLU z = torch.matmul(y, x) # kernel 2: matmul w = z + x # kernel 3: add Each of relu, matmul, add launches a CUDA kernel. Each launch has ~5-50 μs of overhead (kernel launch latency). Each intermediate (y, z) is written to HBM and re-read by the next op. For small operations: kernel launch overhead dominates. For large operations: HBM bandwidth to write/read intermediates dominates.

The eager mode advantage: it’s PURE PYTHON. Every operation runs immediately, you can print intermediate values, use breakpoints, mix with arbitrary Python code. This is what makes PyTorch’s debugging story so much better than TensorFlow 1.x’s graph mode.

The eager mode disadvantage: inefficiency from kernel launch overhead and HBM round-trips. For a typical transformer block with ~50 elementary operations, eager mode does ~50 kernel launches and ~50 HBM round-trips. Each is small individually; the aggregate is significant.

Compiled mode — torch.compile

torch.compile (introduced in PyTorch 2.0, 2023) changes the model:

Compiled mode example: @torch.compile def my_block(x): y = torch.relu(x) z = torch.matmul(y, x) return z + x First call: torch.compile TRACES the function. - Records every operation as a "graph" (functional, no side effects). - Hands the graph to TorchDynamo (frontend) + Inductor (backend). - Inductor: - Decomposes ops into smaller primitives. - Performs algebraic simplifications. - FUSES adjacent ops (relu + add + matmul that share inputs/outputs). - Emits a single (or few) CUDA / Triton kernel(s). - Compiled code is cached. Subsequent calls: invoke the compiled kernel directly. No Python overhead. Speedup: 1.5-3× typical on transformer code. Compilation time: 30-300 seconds for a typical model (cached after first call). Compilation failures: occasional — Dynamo can't trace arbitrary Python (control flow, external library calls). Fallback to eager for problematic parts.

torch.compile brings PyTorch closer to TensorFlow’s old “graph mode” while preserving most of eager mode’s developer ergonomics. The bet: trace the model into a graph at first call; optimize and compile that graph; reuse for all subsequent calls.

— think, then check —

Kernel launch overhead:

Each CUDA kernel launch: ~5-50 μs (depending on driver, queue depth, etc.). Let’s use 20 μs average.

Eager mode: 50 ops × 20 μs = 1 ms of launch overhead per block per forward pass.

Compiled mode: 5-10 kernels × 20 μs = 100-200 μs.

Saves ~800 μs per block. Over 32 blocks in a Llama 7B forward: ~25 ms saved per token.

For a 100-token generation: 2.5 seconds saved. Significant.

HBM round-trips:

Each elementary op (e.g., relu, add, scale) reads and writes its input/output to HBM. For a 4096-dim tensor in bf16: ~16 KB read + 16 KB write per op.

Eager: 50 ops × 32 KB ≈ 1.6 MB of HBM traffic per block per token.

Compiled: fused into ~5 kernels, each handles multiple ops in registers without going back to HBM. ~5 × 32 KB ≈ 160 KB. ~10× less HBM traffic.

HBM bandwidth: 3.35 TB/s on H100. Eager: 1.6 MB / 3.35 TB/s = 0.5 μs per block. Per token: 16 μs across 32 blocks. Smallish.

But for LARGER intermediates (e.g., attention’s QK^T matrix at 4K context), each op moves megabytes, and fusion saves correspondingly more. FlashAttention is the canonical example: 3 ops fused → no N×N matrix materialized → 3-5× speedup.

Total speedup picture:

For small / kernel-launch-bound workloads (Llama 7B at batch 1): launch overhead dominates → torch.compile gives 2-3× speedup.

For medium / mixed workloads (Llama 7B at batch 16): launch overhead matters but matmuls dominate → torch.compile gives 1.3-1.7× speedup.

For large / compute-bound workloads (Llama 70B training at batch 1024): launch overhead is negligible compared to compute → torch.compile gives 1.05-1.2× speedup.

The benefit scales INVERSELY with how “big” your operations are. Small models / small batches / decode mode = biggest torch.compile wins.

The dispatch stack in detail

The dispatcher’s routing logic is more subtle than it looks. Several layers can claim an operation:

PyTorch's dispatch keys (rough simplified): AutogradCUDA: if the tensor is on CUDA and requires_grad, route HERE first. This wraps the actual op with gradient tracking. CUDA: the actual CUDA kernel implementation. CPU: the actual CPU kernel implementation. Sparse: if tensor is sparse, route HERE. Quantized: if tensor is quantized, route HERE. Functorch: for functional transformations (vmap, grad). PythonFallback: custom Python implementation registered via Python API. For torch.matmul(a, b): - If a or b requires_grad → AutogradCUDA → wraps the call with autograd → unwraps → CUDA. - If pure inference (no grad) → CUDA directly. The dispatcher walks these in a specific order, "skipping" keys that don't apply. Custom dispatching is a power-user feature (e.g., registering a quantized matmul that intercepts the dispatch and runs your fp8 kernel instead).
— think, then check —

Step 1 — Python binding (pybind11):

Python’s torch.matmul(a, b) maps to a C++ function via pybind11. This function receives (a, b) as torch::Tensor objects (C++ wrappers around the actual tensor data).

Step 2 — Dispatcher receives:

The dispatcher reads tensor a’s metadata:

  • device: CUDA
  • dtype: bfloat16
  • layout: strided
  • requires_grad: True
  • tensor backend: aten (built-in PyTorch ops)

Step 3 — Dispatch key resolution:

The dispatcher walks the dispatch key list IN ORDER. For requires_grad=True tensors on CUDA, the active keys (in priority order) are:

  1. AutogradCUDA (because requires_grad)
  2. CUDA

Step 4 — AutogradCUDA key handler:

This handler is registered as: “for matmul on autograd-tracked CUDA tensors, wrap the call with gradient tracking.”

It does:

  • Calls the inner matmul (CUDA key — actual computation).
  • Records the operation in the autograd graph (for backward pass).
  • Returns the result tensor with grad_fn = MatmulBackward.

Step 5 — CUDA key handler:

This handler is the actual matmul. It does:

  • Checks the matmul shape (m, k, n).
  • Falls through to cuBLAS for actual matmul (since matmul is dispatched to highly tuned cuBLAS kernels for typical shapes).
  • cuBLAS issues tensor core instructions if matrices are large enough.

Step 6 — cuBLAS handles it:

  • cuBLAS selects the right kernel (e.g., “hgemm_128x128_32x8_nn_align4” for bf16 input).
  • Issues a CUDA stream launch.
  • GPU runs the kernel.
  • Returns asynchronously (CUDA streams).

Step 7 — Result propagates back:

The result tensor flows back through the dispatcher chain: CUDA → AutogradCUDA (which has now wrapped it with grad_fn) → matmul call site.

Python receives the result tensor, with its autograd machinery attached for the eventual backward pass.

Overhead summary:

  • Python → C++ binding: ~1 μs.
  • Dispatcher key resolution: ~2-5 μs.
  • AutogradCUDA wrapping: ~1-2 μs.
  • CUDA kernel launch: ~5-20 μs.
  • Total: ~10-30 μs of “framework overhead” PER OPERATION.

This is why eager-mode for tiny operations (e.g., element-wise scalar ops on small tensors) can be 100× slower than a fused kernel: framework overhead dominates the actual compute.

Where the gains come from in torch.compile

torch.compile optimization passes (Inductor backend): 1. Decomposition: break high-level ops into primitives. e.g., LayerNorm → mean + var + normalize + affine. 2. Common subexpression elimination: detect repeated computations, share. 3. Constant folding: precompute things that don't depend on input. 4. Loop fusion: combine adjacent elementwise ops into one loop. e.g., torch.relu(x) + x → one loop computing both. 5. Memory layout selection: pick the right tensor layout (NCHW vs NHWC etc.). 6. Code generation: emit Triton (preferred) or CUDA kernels. 7. Buffer reuse: allocate intermediate buffers once and reuse, avoiding cudaMalloc overhead. The biggest wins typically: - Fusion of elementwise ops (saves HBM round-trips). - Buffer reuse (avoids allocator overhead). - Kernel launch reduction. - Selective use of Triton kernels for memory-bound ops.
— think, then check —

What fusion does:

Look at a transformer block in eager mode:

x_norm = LayerNorm(x) # kernel A: 4 ops (mean, var, normalize, affine) q = x_norm @ W_Q # kernel B: matmul k = x_norm @ W_K # kernel C: matmul v = x_norm @ W_V # kernel D: matmul scores = q @ k.T / sqrt(d) # kernel E: matmul + scale weights = softmax(scores) # kernel F: softmax (3 ops) attn = weights @ v # kernel G: matmul attn_out = attn @ W_O # kernel H: matmul x = x + attn_out # kernel I: add …same for FFN sublayer… # kernels J-O

Each kernel:

  • Reads its inputs from HBM.
  • Computes.
  • Writes its outputs to HBM.
  • Returns; next kernel starts.

The intermediates (q, k, v, scores, weights, attn) all get materialized in HBM.

What fusion produces:

Inductor (or FlashAttention’s hand-tuned kernel) FUSES the related ops:

one fused kernel

x_norm = LayerNorm(x) q, k, v = x_norm @ W_QKV # one matmul for all three scores = q @ k.T / sqrt(d) weights = softmax_streaming(scores) attn = weights @ v # FUSED with softmax — never materialized attn_out = attn @ W_O x = x + attn_out

Intermediates that fit in SRAM (~228 KB per SM): kept in SRAM. Not written to HBM.

Other intermediates: stored in HBM if needed, but with overlapping loads/computes.

Why this is the dominant speedup:

Eager-mode HBM traffic per block: ~20-50 MB (Q, K, V, scores, weights, attn, ffn intermediates).

Fused HBM traffic per block: ~5-10 MB (residual stream in + out, weights).

3-5× less HBM traffic. Since most ops are memory-bound (Ch.21), this directly translates to 2-3× wall-clock speedup.

Kernel launch reduction is secondary — at modern batch sizes, each kernel is large enough that launch overhead is small. The main win is HBM bandwidth.

The picture:

torch.compile is essentially “automatic FlashAttention for everything else.” It finds opportunities to keep more work in SRAM, just like FlashAttention does for attention, and applies them across the model. The result is broadly applicable speedup without per-model hand-tuning.

Next: §22.2 — CUDA, Triton, MLX. What a ‘kernel’ actually is, what writing one looks like, and the higher-level abstractions (Triton, MLX) that make kernel-writing more accessible.