BEYOND TRANSFORMERS
Section 20.2
02

Hybrid architectures — Jamba, Zamba, Samba, RWKV

By mid-2024, the empirical lesson was clear: pure-Mamba and pure-attention each have failure modes the other doesn’t. Pure Mamba misses on tasks needing precise long-range recall (the fixed-size state can’t hold everything). Pure attention scales quadratically. The fix that’s now ubiquitous: hybrid stacks. Most layers run Mamba; a few strategically-placed attention layers handle the cases Mamba misses. Jamba, Zamba, Samba, Granite 4, and the smaller-scale Falcon-Mamba all follow this pattern, varying mostly in the Mamba-to-attention ratio. This section covers the rationale, the canonical configurations, and the closely-related RWKV lineage which arrives at similar performance from a different starting point.

Why hybrids

The case for each architecture in isolation:

The hybrid bet: most computation can be Mamba-cheap; rare attention-required steps can use a few attention layers. The total cost is dominated by the cheap Mamba layers; the quality is recovered by the attention layers.

Hybrid architectures have come to dominate the SSM space because they get most of the linear-time benefit while keeping a quality floor.

The canonical configurations

Jamba (AI21, 2024) — 12B/52B MoE: Pattern: 1 attention layer per 8 layers (1:8) Total layers: 32 Attention placement: spaced evenly through the stack Plus: MoE in some layers (Mixtral-style) Context: 256K with linear cost in the Mamba layers Zamba (Zyphra, 2024) — 2.7B: Pattern: 1 shared attention layer reused multiple times Mamba layers: ~30 sequential Mamba blocks The attention layer is "called" several times throughout the stack Reduces total params (shared attention) but keeps the recall benefit Samba (Microsoft, 2024) — 3.8B: Pattern: 1 attention layer per 8 layers Notable: uses Mamba + sliding-window attention (4K window per attention layer) This means even the attention layers are O(N · 4K), not O(N²) Hybrid efficiency: both Mamba (linear) AND windowed-attention (linear in N) Granite 4 (IBM, 2025): Pattern: 1 attention per 9 Mamba layers (~10%) Used for code generation and document understanding First production-grade hybrid at the 7B+ scale shipping in enterprise products

The common element: attention is ~10% of layers; Mamba is ~90%. The few attention layers handle the recall-intensive parts of computation; the many Mamba layers do everything else cheaply.

— think, then check —

What’s failing in pure Mamba:

The Mamba state is a fixed-size vector x ∈ ℝ^(N·D). Typical Mamba: N=16, D=4096 → 64K-element state per layer. This state is a COMPRESSION of the entire prefix.

For most language modeling tasks, this compression is fine — the model only needs aggregate context (topic, style, recent entity names). The state’s lossy compression preserves this.

For “Needle in a Haystack” tasks (place a specific fact deep in a 100K-token context, then ask about it), the needle’s information must SURVIVE in the state. With a 64K-element state compressing 100K tokens × 4096 dim = 410M numbers of input, the compression ratio is 6400×. Specific factoids get washed out.

Result: pure Mamba scores 60-80% on Needle in a Haystack at long context, vs Llama-3’s 95-99%.

What hybrid layers fix:

Even one or two attention layers in the stack have access to the FULL prefix via the KV cache — no information is compressed away. They can “look up” the specific factoid directly.

The Mamba layers do the bulk of the computation (cheap); when the model needs precise recall (a specific token’s value), the attention layers handle it (exact).

With ~10% attention layers, Needle in a Haystack scores recover to 90-95% — close to pure attention’s 95-99% at a fraction of the total cost.

The cost math:

10% attention + 90% Mamba in a 32-layer model = 3-4 attention layers, 28-29 Mamba layers.

At T=256K: attention layers cost ~256K² · 4096 ≈ 270 TFLOPs total (vs 80 TFLOPs for the Mamba part). Wait — that’s attention-dominated. Let me reconsider.

Actually the cost ratio depends on T. At T=8K: attention is small (~2 TFLOPs total) vs Mamba (~80 TFLOPs total). At T=256K: attention dominates (~270 vs 80).

For very long contexts where pure attention is impractical, hybrids use SLIDING-WINDOW attention in the attention layers (Samba’s design): attention only attends to the previous 4K tokens, making the attention layer cost ~O(N · 4K) instead of O(N²). Now total cost stays linear in N for ANY ratio of attention.

The bet: a few attention layers (even with sliding window) provide enough recall capability to handle the cases pure Mamba fails on, while most of the compute remains O(N).

RWKV — the other lineage

Peng 2023 “RWKV: Reinventing RNNs for the Transformer Era” arrived at similar conclusions from a different starting point. RWKV is a “linear attention” formulation that:

RWKV mechanism (simplified): For each timestep t: r_t = W_R · u_t (Receptance — what to receive) k_t = W_K · u_t (Key — what to be looked up by) v_t = W_V · u_t (Value — what to provide) Time-weighted aggregation (similar to attention but in O(T)): A_t = Σ_{s ≤ t} exp(w · (t-s) + k_s) · v_s out_t = sigmoid(r_t) · A_t The exp(w · (t-s)) creates an EXPONENTIAL DECAY based on distance — recent tokens have higher weight than distant ones. The whole thing can be computed in O(T) by maintaining a running weighted sum of (k_s, v_s) pairs.

RWKV is structurally distinct from Mamba — it doesn’t have an explicit “state” vector; instead, it maintains a running weighted aggregate of (k, v) pairs. But it shares the core property: linear-time, content-addressable, no quadratic attention.

RWKV’s distinctive feature: dual-mode. During training, it can be computed in PARALLEL like a transformer (all positions at once). During inference, it can be computed in SERIES like an RNN (token by token, with constant state). This dual-mode property is rare and operationally valuable — train fast, deploy efficient.

— think, then check —

The mechanism:

For each timestep t, compute r_t (receptance), k_t (key), v_t (value) — analogous to attention’s R, K, V but used differently.

The aggregation: out_t depends on a SUM over all prior timesteps s ≤ t:

A_t = Σ_(s ≤ t) exp(w · (t-s) + k_s) · v_s

This looks like attention (sum over keys weighted by similarity), but the WEIGHTING is exp(w · (t-s) + k_s) — an exponential function with two parts:

  • w · (t-s): linear decay with distance. w is a (typically negative) trainable parameter. The larger |w|, the more aggressive the decay; the model “forgets” older tokens.
  • k_s: per-token “salience” — a learned function of the input at time s. High k_s means “remember me longer”; low k_s means “I’m not important.”

Operationally: each prior token contributes to A_t with weight that depends on (1) how long ago it was (exp(w · (t-s)) decays with distance), and (2) how important it was (exp(k_s) amplifies salient tokens).

Why this is O(T):

The naive computation of A_t for each t needs O(T) operations (sum over all prior tokens). Over T tokens, that’s O(T²) — same as attention.

But the sum has a RECURRENT structure: A_t = exp(w) · A_(t-1) + exp(k_t) · v_t (after a normalisation step).

This is a linear recurrence on a running “weighted sum” state. We can maintain it in O(1) per step:

numer_t = exp(w) · numer_(t-1) + exp(k_t) · v_t

denom_t = exp(w) · denom_(t-1) + exp(k_t)

A_t = numer_t / denom_t

Out_t = sigmoid(r_t) · A_t.

The state is just (numer_t, denom_t) — two vectors of size d. Total cost: O(T · d). Linear in T.

Comparison to Mamba:

Both achieve O(T). Both have a “fixed-size state” that summarises the prefix. The difference is the mathematical form:

  • Mamba: x_t = A · x_(t-1) + B · u_t (general linear recurrence with input-dependent B, C).
  • RWKV: state is a (numerator, denominator) pair maintaining an exponentially-weighted sum.

RWKV’s exp-decay structure is more rigid (the time weighting is always exponential), but easier to train and implement. Mamba is more flexible but requires the specialised selective-scan kernel.

Empirically: RWKV and Mamba achieve similar perplexity at similar parameter counts. The choice between them is mostly engineering (kernel availability, fine-tuning ecosystem) rather than fundamental capability.

What’s actually shipping in production

Models that have actually shipped (2024-2025): Pure attention (Llama-style): GPT-4, Claude 3/4, Gemini 1.5/2, Llama 3, Mistral, Qwen 2.5/3, DeepSeek V3. Hybrid attention + SSM: Jamba (AI21, GA Jul 2024), Granite 4 (IBM, 2025), Samba (Microsoft research; not commercial yet). Pure SSM: Falcon-Mamba (TII, 2024), Mamba-2 (academic), (no commercial ones at scale) RWKV: RWKV-7 World (2024, ~7B), open weights. Linear attention variants: Some Gemma variants at small scale. The picture: hybrids are GAINING share at the 7-50B mid-range; pure attention still dominates the frontier (where flat-out quality matters and compute is abundant); pure SSM remains an active research area but hasn't shipped commercial frontier models.
— think, then check —

Jamba’s structure:

32 total layers. Each “Jamba block” is a sequence of:

  • 1 attention layer
  • 7 Mamba layers

That’s 4 attention layers and 28 Mamba layers total. Ratio: 1:7.

Of those layers, half are MoE-d (the FFN replaced with 16 experts, top-2 routing). The other half are dense FFNs. MoE is applied to both Mamba and attention layers’ FFN sublayers, not to the SSM or attention computation itself.

Total parameters: 52B if all experts are counted. Active per token: ~12B.

Comparison to Mixtral 8x7B:

  • Mixtral: 8 experts × 7B per expert, top-2 routing, all 32 layers are pure attention.
  • Jamba: 16 experts (in MoE’d layers), 28/32 layers are Mamba (cheap), 4/32 are attention.

Per-token cost at T = 8K (typical):

  • Mixtral: 32 attention layers × O(8K · d) compute each. Active params ~13B used per token. Cost dominated by FFN + attention.
  • Jamba: 4 attention layers × O(8K · d) + 28 Mamba layers × O(d · N · D) for fixed N. Mamba layers are essentially constant-per-token cost. Attention is much smaller fraction of compute. Per-token compute ~30-40% less than Mixtral for similar quality.

Per-token cost at T = 128K (long context):

  • Mixtral: doesn’t fit (~64 GB attention matrices alone); would need extreme optimisation. KV cache: ~hundreds of GB.
  • Jamba: handles it easily. 4 attention layers’ KV cache is small (each layer’s KV is for one head group); 28 Mamba layers have constant state. Total memory: a few GB. Throughput stays near 8K-context throughput.

The Jamba bet:

“Most of the work in language modeling is OK with O(N) processing; a small number of attention layers cover the cases where exact recall matters. Combining this with MoE (which improves quality at fixed compute) gives a model that’s smaller, faster, and longer-context than a pure-attention MoE of the same active-param count.”

Empirically: Jamba’s 12B-active performance is on par with Mixtral 8x7B (13B-active) on most benchmarks. At long context, Jamba dominates because Mixtral can’t go past ~32K efficiently.

Next: §20.3 — Honest assessment. Where SSMs and hybrids win, where they don’t, and what the realistic 2-3 year trajectory looks like.