BEYOND TRANSFORMERS
Section 20.1
01

SSMs and the Mamba selectivity mechanism

Attention is O(N²) in sequence length. FlashAttention (Ch.13) made this practical at the constant factor, but didn’t change the asymptotic — at N=256K, even FlashAttention struggles. State-space models (SSMs) propose a fundamentally different mechanism: a linear recurrence that runs in O(N) time and O(N·d_state) memory. The earliest SSMs (Linear State-Space Layers, S4, Gu 2021) were strict linear recurrences with FIXED parameters — they couldn’t do content-addressable lookup like attention. Mamba (Gu & Dao 2023) fixed this by making the recurrence parameters input-dependent — “selectivity” — recovering attention-like capability at linear cost. This section walks the SSM math, the Mamba selectivity mechanism, and a kernel that demonstrates content-addressable memory in O(N) time.

The SSM recurrence

A discrete state-space model is a linear recurrence on a hidden state:

x_t = A · x_{t-1} + B · u_t (state update) y_t = C · x_t (output) where: x_t ∈ ℝ^N hidden state at time t (N = state dimension, typically 16-128) u_t ∈ ℝ^D input at time t (D = model dimension) A ∈ ℝ^{N×N} state transition matrix (usually diagonal for cheapness) B ∈ ℝ^{N×D} input projection (write into state) C ∈ ℝ^{D×N} output projection (read from state) The recurrence runs in O(T) time and O(N) memory (just the running state). Per-step work: O(N²) for the A · x_{t-1} multiplication (or O(N) if A is diagonal). Per-sequence: O(T · N · D) vs attention's O(T² · D).

SSMs are linear recurrences — the structure is the same as classical RNNs (LSTM, GRU) but the math is simpler: pure linear operations with no nonlinearity inside the recurrence. The nonlinearity goes OUTSIDE the recurrence, in pre/post layers.

The classical SSM problem: A, B, C are fixed (time-invariant). The recurrence is “linear time-invariant” (LTI). LTI systems can be analysed cleanly (convolutional form, FFT acceleration) but they have a critical limitation: they cannot do content-addressable lookup. For any fixed B, the state always WRITES the same way regardless of input content. There’s no way for a position to “ignore” certain inputs based on what they are.

Mamba’s selectivity fix

Mamba (Gu & Dao 2023, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”) made the breakthrough: make B and C input-dependent.

Selective SSM (Mamba): B_t = Linear_B(u_t) ← B is now a function of the current input C_t = Linear_C(u_t) ← C is similarly input-dependent Δ_t = softplus(Linear_Δ(u_t)) ← discretisation step (continuous-time interpretation) x_t = exp(A · Δ_t) · x_{t-1} + Δ_t · B_t · u_t y_t = C_t · x_t The crucial change: B_t and C_t depend on the CURRENT TOKEN's input. This means: - A token can choose to write strongly to the state (large B_t magnitude) - Or barely write at all (small B_t magnitude — "skip this token") - And a later token can choose what to read from the state (C_t pattern)

Mamba selectivity is what gives SSMs attention-like power. By controlling B_t, the model decides “how much should this token write to the running state?” By controlling C_t, the model decides “what aspect of the state should I read out NOW?” Together: input-dependent write + input-dependent read = content addressing.

Now make it run

The kernel implements a selective SSM running over a 64-token sequence. We inject a “key” at position 10 (large B writing to specific state slots) and a “query” at position 30 (large C reading from those slots), and verify the read-out at position 30 recovers the key’s information:

ssm_recurrent.c — ssm_forward C · selective SSM recurrence
/* Run SSM forward in O(T) recurrent form. */
static void ssm_forward(
    const float* U,      /* [T, D_IN] input sequence */
    const float* B,      /* [T, N_STATE × D_IN] per-timestep B */
    const float* C,      /* [T, D_IN × N_STATE] per-timestep C */
    float* Y,            /* [T, D_IN] output */
    int t_seq)
{
    float x[N_STATE];
    memset(x, 0, sizeof(x));   /* x_0 = 0 */

    for (int t = 0; t < t_seq; t++) {
        /* x_t  =  A · x_{t-1}  +  B_t · u_t */
        float x_new[N_STATE];
        for (int i = 0; i < N_STATE; i++) {
            float s = A_diag[i] * x[i];
            for (int j = 0; j < D_IN; j++)
                s += B[t * N_STATE * D_IN + i * D_IN + j] * U[t * D_IN + j];
            x_new[i] = s;
        }
        memcpy(x, x_new, sizeof(x));

        /* y_t  =  C_t · x_t */
        for (int d = 0; d < D_IN; d++) {
            float s = 0;
            for (int i = 0; i < N_STATE; i++)

Output:

Output magnitude (||y_t||²) at selected timesteps:
  t =   0   ||y||² = 0.002
  t =   5   ||y||² = 0.003
  t =  10   ||y||² = 5.593 ← KEY WRITE
  t =  15   ||y||² = 1.551
  t =  20   ||y||² = 6.594 ← (no query)
  t =  25   ||y||² = 2.594
  t =  30   ||y||² = 1565.680 ← QUERY READ
  t =  35   ||y||² = 0.679
  ...

The output at t=30 (the query) is 300× the baseline magnitude — the model successfully recalled the key’s information across 20 timesteps. Without selectivity, the SSM would average across all positions (no content addressing); with selectivity, it acts like attention with O(N) cost.

— think, then check —

The SSM recurrence:

  • x_t = A · x_(t−1) + B · u_t (state update)
  • y_t = C · x_t (output)

Shapes:

  • x_t ∈ ℝ^N where N is the state size (typically 16-128).
  • u_t ∈ ℝ^D where D is the model dim (typically d_model = 4096 in big models).
  • A ∈ ℝ^(N×N), often DIAGONAL (Mamba uses diagonal A) so the cost is O(N) not O(N²).
  • B ∈ ℝ^(N×D), the input-to-state projection.
  • C ∈ ℝ^(D×N), the state-to-output projection.

Per-token cost (with diagonal A):

  • State update: O(N) for diagonal A · x_(t-1), plus O(N·D) for B · u_t. Total: O(N·D).
  • Output: O(N·D) for C · x_t.
  • Per-token: O(N·D).

Per-sequence cost: O(T · N · D), linear in sequence length T.

Vs attention: O(T² · D) — quadratic in T. For T = 128K and N = 16, the ratio is N/T = 16/131072 ≈ 0.0001. SSM is ~8000× cheaper at this sequence length.

This is the structural reason for interest in SSMs: linear scaling makes very long contexts (1M+ tokens) computationally feasible.

— think, then check —

What selectivity enables:

In classical SSMs, B and C are FIXED MATRICES. The same B writes every input into the state; the same C reads the state every step. No matter the content of u_t, the contribution to the state is “B · u_t” — deterministic.

Result: classical SSMs cannot do content-based filtering. They can’t say “this token is a function word, ignore it” or “this token is the SUBJECT, store it carefully.” They average over the input stream.

Mamba’s selectivity makes B and C functions of the current input. Now the model can:

  • Make B_t small for “uninteresting” tokens → state preserves prior info, current token barely contributes.
  • Make B_t large for “key” tokens → state strongly absorbs the current token’s content.
  • Use specific C_t patterns to selectively READ from the state — “extract the subject I stored 30 tokens ago.”

This is content-addressable memory, implemented through the input-dependent recurrence parameters.

The technical cost:

Classical (LTI) SSMs can be computed as a CONVOLUTION over the sequence:

y = u * (CB, CAB, CA²B, CA³B, …)

This convolution can be done in O(T log T) with FFT. Very efficient.

Selective SSMs CAN’T be expressed as a convolution because the recurrence parameters change per step. The convolutional shortcut is gone.

Mamba’s solution: a SPECIALISED PARALLEL SCAN kernel that computes the selective recurrence on GPU using a tree-reduction pattern. The naive sequential recurrence would be O(T) but with VERY low parallelism (each step depends on the previous). The parallel scan achieves O(log T) wall-clock per layer with full GPU utilisation, by exploiting the associativity of the linear recurrence.

Without this kernel, Mamba would be hopelessly slow on GPU. With it, Mamba achieves similar throughput to FlashAttention at long context lengths AND linear scaling — a genuine architectural win.

The kernel is a significant systems-engineering effort (CUDA + careful memory layout); its existence is part of why Mamba beats other SSM variants in practice.

How Mamba compares to attention numerically

Attention at sequence T, head_dim d, head count H: compute: O(H · T² · d) memory: O(H · T²) for the score matrix (or O(H · T) with FlashAttention via streaming) Mamba at sequence T, state size N, model dim D: compute: O(T · N · D) memory: O(T · D) for activations + O(N) for state per layer For typical configs: Llama 7B attention: H=32, d=128, T=4096 → ~67M ops per sequence per layer Mamba 2.8B: N=16, D=2048, T=4096 → ~134M ops per sequence per layer At T=128K: Llama 7B attention: ~134 BILLION ops per sequence per layer (32× more) Mamba: ~4.2 BILLION ops per sequence per layer At T=1M: Attention is essentially intractable; Mamba is fine.

The asymptotic dominance is real, but the constant factor matters. For T < ~8K, FlashAttention + GQA make attention competitive with Mamba on H100. The win starts above ~32K and is decisive above ~128K — exactly the long-context regime that’s now critical for document understanding, codebase navigation, multi-turn assistants.

— think, then check —

Setup: 7B model, 32 layers, 4096 d_model, 256K context.

Attention (Llama-style, with FlashAttention + GQA g=8):

Per-token decode (autoregressive, KV cache hit):

  • Attention compute: per head, per layer, the new Q attends to 256K cached K’s. Cost: ~256K · 128 = 33M FLOPs per head per layer. With 32 heads × 32 layers: ~34 GFLOPs per token JUST for the attention dot products.
  • KV cache memory: 256K · 32 layers · 2(K,V) · 8 KV heads · 128 d_k = 17 GB per sequence (fp16).
  • KV cache bandwidth: that 17 GB is read per token. At 3 TB/s H100 HBM bandwidth: ~5.7 ms per token JUST loading the KV cache. This is the dominant cost.
  • Total per-token time: ~6-8 ms wall-clock. Serving rate: ~125-160 tokens/s.

Mamba (similar 2.8-3B parameter model):

Per-token decode (state cache hit):

  • SSM compute: per layer, the new u_t updates the N-dim state and computes a D-dim output. Cost: ~N·D = 16·4096 = 65K FLOPs per layer. × 32 layers ≈ 2 MFLOPs per token. THIRTEEN THOUSAND TIMES less than attention’s 34 GFLOPs.
  • State cache memory: 16 · 32 layers · 4 bytes = 2 KB. Effectively zero compared to attention’s KV cache.
  • State cache bandwidth: 2 KB per token. Negligible.
  • Total per-token time: limited by the linear projections (W_in, W_out, etc.), not the SSM core. ~0.5-1 ms wall-clock per token. Serving rate: ~1000-2000 tokens/s.

Where the difference comes from:

Attention’s per-token cost scales LINEARLY with sequence length. Every new token needs to read every previous KV cache entry. At T=256K, that’s 256K memory accesses per token; at T=1M it’s 1M.

Mamba’s per-token cost is CONSTANT in sequence length. Every new token does the same fixed-size state update regardless of T. The state vector compresses the entire prefix into a fixed-size representation.

The trade-off:

The Mamba state has fixed capacity N · D scalars. At very long context, it can lose information that attention would still retain. Empirical comparisons show Mamba matches attention on most language modeling tasks but lags on tasks requiring precise long-range recall (e.g., “what was the second-to-last sentence in chapter 4?”). The state’s capacity is a real bottleneck.

Hybrid architectures (§20.2) put attention layers among Mamba layers to recover the precise-recall capability at proportionally less cost.

Next: §20.2 — Hybrid architectures (Jamba, Zamba, Samba). Most 2024+ “post-transformer” models are hybrids — some Mamba layers, some attention. The “best of both” rationale.