PyTorch eager vs compiled — the dispatch stack
PyTorch is the framework most ML practitioners encounter first, and almost no one understands what’s actually happening when they call torch.matmul(a, b). The chain is: Python interpreter → PyTorch Python binding → PyTorch C++ dispatcher → operation router → backend kernel (cuBLAS / FlashAttention / aten implementation) → CUDA driver → GPU. Six layers of indirection. This section walks the dispatch stack, explains the difference between PyTorch’s default EAGER mode and the newer COMPILED mode (torch.compile / Inductor), and shows where typical performance gains and losses come from.
The Python → kernel path
When you write torch.matmul(a, b), here’s what happens:
PyTorch dispatcher is the central abstraction. Every operation goes through it; the metadata-based routing is what makes “the same Python code work on CPU and GPU” possible.
Eager mode — the default
In eager mode, every Python operation triggers an immediate kernel launch:
The eager mode advantage: it’s PURE PYTHON. Every operation runs immediately, you can print intermediate values, use breakpoints, mix with arbitrary Python code. This is what makes PyTorch’s debugging story so much better than TensorFlow 1.x’s graph mode.
The eager mode disadvantage: inefficiency from kernel launch overhead and HBM round-trips. For a typical transformer block with ~50 elementary operations, eager mode does ~50 kernel launches and ~50 HBM round-trips. Each is small individually; the aggregate is significant.
Compiled mode — torch.compile
torch.compile (introduced in PyTorch 2.0, 2023) changes the model:
torch.compile brings PyTorch closer to TensorFlow’s old “graph mode” while preserving most of eager mode’s developer ergonomics. The bet: trace the model into a graph at first call; optimize and compile that graph; reuse for all subsequent calls.
Kernel launch overhead:
Each CUDA kernel launch: ~5-50 μs (depending on driver, queue depth, etc.). Let’s use 20 μs average.
Eager mode: 50 ops × 20 μs = 1 ms of launch overhead per block per forward pass.
Compiled mode: 5-10 kernels × 20 μs = 100-200 μs.
Saves ~800 μs per block. Over 32 blocks in a Llama 7B forward: ~25 ms saved per token.
For a 100-token generation: 2.5 seconds saved. Significant.
HBM round-trips:
Each elementary op (e.g., relu, add, scale) reads and writes its input/output to HBM. For a 4096-dim tensor in bf16: ~16 KB read + 16 KB write per op.
Eager: 50 ops × 32 KB ≈ 1.6 MB of HBM traffic per block per token.
Compiled: fused into ~5 kernels, each handles multiple ops in registers without going back to HBM. ~5 × 32 KB ≈ 160 KB. ~10× less HBM traffic.
HBM bandwidth: 3.35 TB/s on H100. Eager: 1.6 MB / 3.35 TB/s = 0.5 μs per block. Per token: 16 μs across 32 blocks. Smallish.
But for LARGER intermediates (e.g., attention’s QK^T matrix at 4K context), each op moves megabytes, and fusion saves correspondingly more. FlashAttention is the canonical example: 3 ops fused → no N×N matrix materialized → 3-5× speedup.
Total speedup picture:
For small / kernel-launch-bound workloads (Llama 7B at batch 1): launch overhead dominates → torch.compile gives 2-3× speedup.
For medium / mixed workloads (Llama 7B at batch 16): launch overhead matters but matmuls dominate → torch.compile gives 1.3-1.7× speedup.
For large / compute-bound workloads (Llama 70B training at batch 1024): launch overhead is negligible compared to compute → torch.compile gives 1.05-1.2× speedup.
The benefit scales INVERSELY with how “big” your operations are. Small models / small batches / decode mode = biggest torch.compile wins.
The dispatch stack in detail
The dispatcher’s routing logic is more subtle than it looks. Several layers can claim an operation:
Step 1 — Python binding (pybind11):
Python’s torch.matmul(a, b) maps to a C++ function via pybind11. This function receives (a, b) as torch::Tensor objects (C++ wrappers around the actual tensor data).
Step 2 — Dispatcher receives:
The dispatcher reads tensor a’s metadata:
- device: CUDA
- dtype: bfloat16
- layout: strided
- requires_grad: True
- tensor backend: aten (built-in PyTorch ops)
Step 3 — Dispatch key resolution:
The dispatcher walks the dispatch key list IN ORDER. For requires_grad=True tensors on CUDA, the active keys (in priority order) are:
- AutogradCUDA (because requires_grad)
- CUDA
Step 4 — AutogradCUDA key handler:
This handler is registered as: “for matmul on autograd-tracked CUDA tensors, wrap the call with gradient tracking.”
It does:
- Calls the inner matmul (CUDA key — actual computation).
- Records the operation in the autograd graph (for backward pass).
- Returns the result tensor with grad_fn = MatmulBackward.
Step 5 — CUDA key handler:
This handler is the actual matmul. It does:
- Checks the matmul shape (m, k, n).
- Falls through to cuBLAS for actual matmul (since matmul is dispatched to highly tuned cuBLAS kernels for typical shapes).
- cuBLAS issues tensor core instructions if matrices are large enough.
Step 6 — cuBLAS handles it:
- cuBLAS selects the right kernel (e.g., “hgemm_128x128_32x8_nn_align4” for bf16 input).
- Issues a CUDA stream launch.
- GPU runs the kernel.
- Returns asynchronously (CUDA streams).
Step 7 — Result propagates back:
The result tensor flows back through the dispatcher chain: CUDA → AutogradCUDA (which has now wrapped it with grad_fn) → matmul call site.
Python receives the result tensor, with its autograd machinery attached for the eventual backward pass.
Overhead summary:
- Python → C++ binding: ~1 μs.
- Dispatcher key resolution: ~2-5 μs.
- AutogradCUDA wrapping: ~1-2 μs.
- CUDA kernel launch: ~5-20 μs.
- Total: ~10-30 μs of “framework overhead” PER OPERATION.
This is why eager-mode for tiny operations (e.g., element-wise scalar ops on small tensors) can be 100× slower than a fused kernel: framework overhead dominates the actual compute.
Where the gains come from in torch.compile
What fusion does:
Look at a transformer block in eager mode:
x_norm = LayerNorm(x) # kernel A: 4 ops (mean, var, normalize, affine) q = x_norm @ W_Q # kernel B: matmul k = x_norm @ W_K # kernel C: matmul v = x_norm @ W_V # kernel D: matmul scores = q @ k.T / sqrt(d) # kernel E: matmul + scale weights = softmax(scores) # kernel F: softmax (3 ops) attn = weights @ v # kernel G: matmul attn_out = attn @ W_O # kernel H: matmul x = x + attn_out # kernel I: add …same for FFN sublayer… # kernels J-O
Each kernel:
- Reads its inputs from HBM.
- Computes.
- Writes its outputs to HBM.
- Returns; next kernel starts.
The intermediates (q, k, v, scores, weights, attn) all get materialized in HBM.
What fusion produces:
Inductor (or FlashAttention’s hand-tuned kernel) FUSES the related ops:
one fused kernel
x_norm = LayerNorm(x) q, k, v = x_norm @ W_QKV # one matmul for all three scores = q @ k.T / sqrt(d) weights = softmax_streaming(scores) attn = weights @ v # FUSED with softmax — never materialized attn_out = attn @ W_O x = x + attn_out
Intermediates that fit in SRAM (~228 KB per SM): kept in SRAM. Not written to HBM.
Other intermediates: stored in HBM if needed, but with overlapping loads/computes.
Why this is the dominant speedup:
Eager-mode HBM traffic per block: ~20-50 MB (Q, K, V, scores, weights, attn, ffn intermediates).
Fused HBM traffic per block: ~5-10 MB (residual stream in + out, weights).
3-5× less HBM traffic. Since most ops are memory-bound (Ch.21), this directly translates to 2-3× wall-clock speedup.
Kernel launch reduction is secondary — at modern batch sizes, each kernel is large enough that launch overhead is small. The main win is HBM bandwidth.
The picture:
torch.compile is essentially “automatic FlashAttention for everything else.” It finds opportunities to keep more work in SRAM, just like FlashAttention does for attention, and applies them across the model. The result is broadly applicable speedup without per-model hand-tuning.
Next: §22.2 — CUDA, Triton, MLX. What a ‘kernel’ actually is, what writing one looks like, and the higher-level abstractions (Triton, MLX) that make kernel-writing more accessible.