CUDA, Triton, MLX — what a ‘kernel’ is
Throughout Parts I-II we wrote SIMD kernels in C — explicit AVX2 intrinsics, manual cache blocking, tight loops. Those run on CPUs. Their GPU equivalents are CUDA kernels — also C/C++, also explicit, but written for thousands of GPU threads running the same code. The catch: writing CUDA is tedious and error-prone. Triton (OpenAI 2021) is a Python DSL that compiles to CUDA but lets you write kernels in something closer to numpy. MLX (Apple 2023) is a higher-level framework for Apple Silicon. This section shows a real CUDA dot-product, compares to its Triton equivalent, and walks the kernel-writing stack.
A CUDA dot product
The CUDA equivalent of Ch.1’s dot product kernel:
What’s going on in this kernel:
- Many threads run in parallel (typically 128-1024 per block, many blocks per launch).
- Each thread computes a partial sum of the dot product over its assigned range.
- Shared memory (SRAM!) holds partial sums during a tree-reduction within the block.
atomicAddcombines block sums into a single global result.
CUDA kernels give full control over the GPU but at the cost of boilerplate. The dot-product example is “simple” by CUDA standards; a FlashAttention kernel is 2000+ lines of CUDA, optimised for specific GPU generations.
Triton — Python DSL that compiles to CUDA
OpenAI 2021 (“Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations”) introduced a Python DSL that targets CUDA:
Triton is dramatically more concise than CUDA — typically 5-10× fewer lines for equivalent kernels. The compiler handles:
- Thread indexing within blocks.
- Shared memory allocation and synchronisation.
- Warp-level reduction patterns.
- Pointer arithmetic and masking.
What the programmer specifies:
- The block size and tiling strategy.
- The high-level computation (load, compute, store).
- Memory access patterns (Triton’s compiler optimises these).
Triton’s hidden cost: it’s less flexible than raw CUDA. Some optimisations (specific warp-level tricks, complex shared-memory layouts) need raw CUDA. But for ~80% of LLM kernel use cases, Triton is enough and dramatically faster to write.
The three levels:
- Thread: the smallest unit of execution. Each thread runs the kernel function independently. A thread has its own registers and program counter. Typical work: ONE output element or a small tile of output.
- Block: a group of threads that run on the same Streaming Multiprocessor (SM). Threads in a block can share memory (via shared memory / SRAM) and synchronise (__syncthreads). Typical block size: 128-1024 threads.
- Grid: all the blocks needed to cover the work. Each block runs INDEPENDENTLY (cannot synchronise with other blocks) on whichever SM has capacity.
Mapping for a matmul C = A · B with A: M×K, B: K×N:
- Block layout: use 2D blocks, where each block computes a TILE of C of size BLOCK_M × BLOCK_N (e.g., 128 × 128). Total blocks: (M / BLOCK_M) × (N / BLOCK_N).
- Thread layout: within a block, 256 threads. Each thread computes a small (BLOCK_M/16) × (BLOCK_N/16) chunk of the block’s tile (e.g., 8 × 8 = 64 output values per thread).
- Per-thread work: loop over k from 0 to K, loading A’s column and B’s row into registers / shared memory, accumulating into the thread’s output chunk.
- Shared memory usage: within each block, threads cooperatively load A’s tile (BLOCK_M × BLOCK_K) and B’s tile (BLOCK_K × BLOCK_N) into shared memory. Each thread then reads from shared memory for its accumulations.
- Tensor cores: for bf16/fp16 matmul, the inner loop uses mma.sync instructions (Hopper) that do 16×16 matmul per cycle.
For attention’s Q·K^T:
- Same structure as matmul, with Q and K being the inputs.
- The KEY innovation of FlashAttention: load Q tile and K tile into shared memory, compute scores, apply softmax (running stats), apply to V — ALL while staying in shared memory. Don’t materialise the full N×N score matrix in HBM.
- This requires more shared memory management than a plain matmul — hence the FlashAttention kernel’s complexity (2000+ lines of CUDA).
Why this matters:
Good GPU performance comes from:
- Right tile sizes so shared memory is well utilised.
- Coalesced HBM accesses (adjacent threads load adjacent memory).
- Hiding HBM latency by overlapping loads with computes.
- Using tensor cores when possible.
Triton handles much of this automatically; raw CUDA requires careful hand-tuning.
MLX — Apple Silicon’s framework
Apple 2023 (MLX) is Apple’s answer to “how do you efficiently run ML on M-series chips”:
MLX is Apple’s bet that the “future of consumer ML” is on-device. It’s well-designed for the M-series architecture (specifically the unified memory) and has good integration with Apple’s tooling. The community is small but growing.
The catch: MLX is Apple-only. There’s no path to CUDA, no community of kernel writers, no path to scale. It’s a specialised tool for a specific platform.
The trade-off:
Triton is easier to write, easier to debug, and well-supported by the Inductor compiler. The compiler handles many optimisations automatically.
Raw CUDA gives full control but requires writing more code and managing more details. Specific low-level optimisations (warp specialisation, specific memory patterns, async copies) are easier in CUDA.
When Triton is enough:
For ~80% of new kernel work:
- Element-wise operations (LayerNorm, GELU, RMS).
- Simple matmuls.
- Fused operations where the fusion is the main win.
- Quick prototyping and iteration.
- Kernels for less-popular precisions (fp4, fp8) where CUDA’s tooling is less mature.
FlashAttention 2’s Triton version achieved ~80% of the hand-tuned CUDA version’s performance with 1/10 the code. For most labs, this is the right trade-off.
When raw CUDA is needed:
For peak performance on specific hardware:
- Cutting-edge optimisations (warp specialisation, async copies, specific PTX instructions).
- Multi-stage pipelining that Triton’s compiler doesn’t optimise.
- Hopper-specific features (TMA, distributed shared memory) that Triton doesn’t yet support.
- Performance gaps that matter at billion-dollar training-run scale.
FlashAttention 3 (2024) targets H100 specifically and uses CUDA for the latest hardware features. The 20% performance gap over Triton matters when you’re training $100M models.
The pattern:
- Research / prototyping: Triton (or torch.compile generating Triton).
- Production / standard inference: Triton, hand-tuned where needed.
- Frontier / pre-training: raw CUDA (FlashAttention 3, CUTLASS, custom kernels).
The hierarchy reflects the cost trade-off: Triton saves engineering time at the cost of last 10-20% performance. At small scale, save time. At frontier scale, that 20% is worth months of engineering.
The kernel stack in practice
Why Apple shipped MLX:
PyTorch + MPS (Metal Performance Shaders) backend works on Mac but with friction:
- Performance: significantly worse than the same model on equivalent CUDA. Apple Silicon’s hardware can do more than what MPS exposes.
- Reliability: some operations fall back to CPU (slow); others crash or produce wrong results.
- Tooling: limited support for Apple-specific features (unified memory, ANE).
- Roadmap: PyTorch’s MPS development is community-driven; Apple has limited influence over priorities.
Apple’s response: build a framework FROM SCRATCH for their hardware.
MLX’s niche:
- On-device inference: the primary target. Run frontier LLMs locally on Mac without depending on PyTorch.
- Apple Silicon-specific optimisations: tight integration with unified memory; uses Metal Performance Shaders; can dispatch to the Neural Engine.
- Research at Apple: Apple’s internal ML research uses MLX; the framework is shaped by their use cases.
- Open source: first major Apple ML framework that’s openly developed; community can contribute.
Vs PyTorch + MPS:
- MLX is FASTER on Mac for typical inference workloads (2-3× in many benchmarks).
- MLX is SIMPLER for Mac-specific use cases (unified memory, on-device).
- MLX is more LIMITED in ecosystem: fewer libraries, fewer pre-trained models, less community knowledge.
- MLX is Mac-only — no path to NVIDIA, no path to scale.
The realistic choice:
If you’re building a Mac-only app (e.g., a desktop LLM tool, an iOS app): MLX is becoming the right choice.
If you’re doing cross-platform research: PyTorch (with MPS for Mac development, CUDA for actual experiments).
If you’re at a frontier lab: NVIDIA / CUDA, full stop.
MLX is positioned as the “consumer / edge” ML framework for Apple’s ecosystem. It’s not competing with PyTorch for the cloud / research market, but it’s clearly the right tool for the on-device space.
Next: §22.3 — ONNX, GGUF, safetensors. Model interchange formats and what a model file actually contains.