Hybrid architectures — Jamba, Zamba, Samba, RWKV

Section 20.2

Hybrid architectures — Jamba, Zamba, Samba, RWKV

By mid-2024, the empirical lesson was clear: pure-Mamba and pure-attention each have failure modes the other doesn’t. Pure Mamba misses on tasks needing precise long-range recall (the fixed-size state can’t hold everything). Pure attention scales quadratically. The fix that’s now ubiquitous: hybrid stacks. Most layers run Mamba; a few strategically-placed attention layers handle the cases Mamba misses. Jamba, Zamba, Samba, Granite 4, and the smaller-scale Falcon-Mamba all follow this pattern, varying mostly in the Mamba-to-attention ratio. This section covers the rationale, the canonical configurations, and the closely-related RWKV lineage which arrives at similar performance from a different starting point.

Why hybrids

The case for each architecture in isolation:

Pure Mamba: linear-time, fixed-size state. Wins on long sequences when content can be summarised. Loses on tasks where every token might be recalled (e.g., “what was the 17th word in paragraph 3?”).
Pure attention: O(N²) but exact recall. Wins on tasks needing precise long-range lookup. Loses on cost at long context.

The hybrid bet: most computation can be Mamba-cheap; rare attention-required steps can use a few attention layers. The total cost is dominated by the cheap Mamba layers; the quality is recovered by the attention layers.

Hybrid architectures 2024 design pattern A transformer-like architecture that mixes Mamba/SSM layers and attention layers. Typical ratio: 7-15 Mamba layers per 1 attention layer. The Mamba layers provide cheap O(N) computation; the attention layers provide content-addressable recall when needed. Examples: Jamba (AI21, 2024), Zamba (Zyphra, 2024), Samba (Microsoft, 2024), Granite 4 (IBM, 2025), Falcon-Mamba (TII, 2024). Has largely displaced pure SSMs in practice. have come to dominate the SSM space because they get most of the linear-time benefit while keeping a quality floor.

The canonical configurations

Jamba (AI21, 2024) — 12B/52B MoE: Pattern: 1 attention layer per 8 layers (1:8) Total layers: 32 Attention placement: spaced evenly through the stack Plus: MoE in some layers (Mixtral-style) Context: 256K with linear cost in the Mamba layers Zamba (Zyphra, 2024) — 2.7B: Pattern: 1 shared attention layer reused multiple times Mamba layers: ~30 sequential Mamba blocks The attention layer is "called" several times throughout the stack Reduces total params (shared attention) but keeps the recall benefit Samba (Microsoft, 2024) — 3.8B: Pattern: 1 attention layer per 8 layers Notable: uses Mamba + sliding-window attention (4K window per attention layer) This means even the attention layers are O(N · 4K), not O(N²) Hybrid efficiency: both Mamba (linear) AND windowed-attention (linear in N) Granite 4 (IBM, 2025): Pattern: 1 attention per 9 Mamba layers (~10%) Used for code generation and document understanding First production-grade hybrid at the 7B+ scale shipping in enterprise products

The common element: attention is ~10% of layers; Mamba is ~90%. The few attention layers handle the recall-intensive parts of computation; the many Mamba layers do everything else cheaply.

— think, then check —

What’s failing in pure Mamba:

The Mamba state is a fixed-size vector x ∈ ℝ^(N·D). Typical Mamba: N=16, D=4096 → 64K-element state per layer. This state is a COMPRESSION of the entire prefix.

For most language modeling tasks, this compression is fine — the model only needs aggregate context (topic, style, recent entity names). The state’s lossy compression preserves this.

For “Needle in a Haystack” tasks (place a specific fact deep in a 100K-token context, then ask about it), the needle’s information must SURVIVE in the state. With a 64K-element state compressing 100K tokens × 4096 dim = 410M numbers of input, the compression ratio is 6400×. Specific factoids get washed out.

Result: pure Mamba scores 60-80% on Needle in a Haystack at long context, vs Llama-3’s 95-99%.

What hybrid layers fix:

Even one or two attention layers in the stack have access to the FULL prefix via the KV cache — no information is compressed away. They can “look up” the specific factoid directly.

The Mamba layers do the bulk of the computation (cheap); when the model needs precise recall (a specific token’s value), the attention layers handle it (exact).

With ~10% attention layers, Needle in a Haystack scores recover to 90-95% — close to pure attention’s 95-99% at a fraction of the total cost.

The cost math:

10% attention + 90% Mamba in a 32-layer model = 3-4 attention layers, 28-29 Mamba layers.

At T=256K: attention layers cost ~256K² · 4096 ≈ 270 TFLOPs total (vs 80 TFLOPs for the Mamba part). Wait — that’s attention-dominated. Let me reconsider.

Actually the cost ratio depends on T. At T=8K: attention is small (~2 TFLOPs total) vs Mamba (~80 TFLOPs total). At T=256K: attention dominates (~270 vs 80).

For very long contexts where pure attention is impractical, hybrids use SLIDING-WINDOW attention in the attention layers (Samba’s design): attention only attends to the previous 4K tokens, making the attention layer cost ~O(N · 4K) instead of O(N²). Now total cost stays linear in N for ANY ratio of attention.

The bet: a few attention layers (even with sliding window) provide enough recall capability to handle the cases pure Mamba fails on, while most of the compute remains O(N).

↳ §20.2 + Jamba/Samba papers

RWKV — the other lineage

Peng 2023 “RWKV: Reinventing RNNs for the Transformer Era” arrived at similar conclusions from a different starting point. RWKV is a “linear attention” formulation that:

RWKV mechanism (simplified): For each timestep t: r_t = W_R · u_t (Receptance — what to receive) k_t = W_K · u_t (Key — what to be looked up by) v_t = W_V · u_t (Value — what to provide) Time-weighted aggregation (similar to attention but in O(T)): A_t = Σ_{s ≤ t} exp(w · (t-s) + k_s) · v_s out_t = sigmoid(r_t) · A_t The exp(w · (t-s)) creates an EXPONENTIAL DECAY based on distance — recent tokens have higher weight than distant ones. The whole thing can be computed in O(T) by maintaining a running weighted sum of (k_s, v_s) pairs.

RWKV alternative architecture A linear-time alternative to attention (Peng 2023) that combines RNN-style state propagation with transformer-style learned KV projections. The mechanism: aggregate (k_s, v_s) pairs over the prefix with an exponentially-decaying time weight; the running sum can be maintained recurrently in O(T). Trains efficiently in parallel mode (transformer-style); runs efficiently in serial mode (RNN-style). 14B+ models published; competitive on most benchmarks with smaller compute footprint. is structurally distinct from Mamba — it doesn’t have an explicit “state” vector; instead, it maintains a running weighted aggregate of (k, v) pairs. But it shares the core property: linear-time, content-addressable, no quadratic attention.

RWKV’s distinctive feature: dual-mode. During training, it can be computed in PARALLEL like a transformer (all positions at once). During inference, it can be computed in SERIES like an RNN (token by token, with constant state). This dual-mode property is rare and operationally valuable — train fast, deploy efficient.

— think, then check —

The mechanism:

For each timestep t, compute r_t (receptance), k_t (key), v_t (value) — analogous to attention’s R, K, V but used differently.

The aggregation: out_t depends on a SUM over all prior timesteps s ≤ t:

A_t = Σ_(s ≤ t) exp(w · (t-s) + k_s) · v_s

This looks like attention (sum over keys weighted by similarity), but the WEIGHTING is exp(w · (t-s) + k_s) — an exponential function with two parts:

w · (t-s): linear decay with distance. w is a (typically negative) trainable parameter. The larger |w|, the more aggressive the decay; the model “forgets” older tokens.
k_s: per-token “salience” — a learned function of the input at time s. High k_s means “remember me longer”; low k_s means “I’m not important.”

Operationally: each prior token contributes to A_t with weight that depends on (1) how long ago it was (exp(w · (t-s)) decays with distance), and (2) how important it was (exp(k_s) amplifies salient tokens).

Why this is O(T):

The naive computation of A_t for each t needs O(T) operations (sum over all prior tokens). Over T tokens, that’s O(T²) — same as attention.

But the sum has a RECURRENT structure: A_t = exp(w) · A_(t-1) + exp(k_t) · v_t (after a normalisation step).

This is a linear recurrence on a running “weighted sum” state. We can maintain it in O(1) per step:

numer_t = exp(w) · numer_(t-1) + exp(k_t) · v_t

denom_t = exp(w) · denom_(t-1) + exp(k_t)

A_t = numer_t / denom_t

Out_t = sigmoid(r_t) · A_t.

The state is just (numer_t, denom_t) — two vectors of size d. Total cost: O(T · d). Linear in T.

Comparison to Mamba:

Both achieve O(T). Both have a “fixed-size state” that summarises the prefix. The difference is the mathematical form:

Mamba: x_t = A · x_(t-1) + B · u_t (general linear recurrence with input-dependent B, C).
RWKV: state is a (numerator, denominator) pair maintaining an exponentially-weighted sum.

RWKV’s exp-decay structure is more rigid (the time weighting is always exponential), but easier to train and implement. Mamba is more flexible but requires the specialised selective-scan kernel.

Empirically: RWKV and Mamba achieve similar perplexity at similar parameter counts. The choice between them is mostly engineering (kernel availability, fine-tuning ecosystem) rather than fundamental capability.

↳ §20.2 + RWKV paper

What’s actually shipping in production

Models that have actually shipped (2024-2025): Pure attention (Llama-style): GPT-4, Claude 3/4, Gemini 1.5/2, Llama 3, Mistral, Qwen 2.5/3, DeepSeek V3. Hybrid attention + SSM: Jamba (AI21, GA Jul 2024), Granite 4 (IBM, 2025), Samba (Microsoft research; not commercial yet). Pure SSM: Falcon-Mamba (TII, 2024), Mamba-2 (academic), (no commercial ones at scale) RWKV: RWKV-7 World (2024, ~7B), open weights. Linear attention variants: Some Gemma variants at small scale. The picture: hybrids are GAINING share at the 7-50B mid-range; pure attention still dominates the frontier (where flat-out quality matters and compute is abundant); pure SSM remains an active research area but hasn't shipped commercial frontier models.

— think, then check —

Jamba’s structure:

32 total layers. Each “Jamba block” is a sequence of:

1 attention layer
7 Mamba layers

That’s 4 attention layers and 28 Mamba layers total. Ratio: 1:7.

Of those layers, half are MoE-d (the FFN replaced with 16 experts, top-2 routing). The other half are dense FFNs. MoE is applied to both Mamba and attention layers’ FFN sublayers, not to the SSM or attention computation itself.

Total parameters: 52B if all experts are counted. Active per token: ~12B.

Comparison to Mixtral 8x7B:

Mixtral: 8 experts × 7B per expert, top-2 routing, all 32 layers are pure attention.
Jamba: 16 experts (in MoE’d layers), 28/32 layers are Mamba (cheap), 4/32 are attention.

Per-token cost at T = 8K (typical):

Mixtral: 32 attention layers × O(8K · d) compute each. Active params ~13B used per token. Cost dominated by FFN + attention.
Jamba: 4 attention layers × O(8K · d) + 28 Mamba layers × O(d · N · D) for fixed N. Mamba layers are essentially constant-per-token cost. Attention is much smaller fraction of compute. Per-token compute ~30-40% less than Mixtral for similar quality.

Per-token cost at T = 128K (long context):

Mixtral: doesn’t fit (~64 GB attention matrices alone); would need extreme optimisation. KV cache: ~hundreds of GB.
Jamba: handles it easily. 4 attention layers’ KV cache is small (each layer’s KV is for one head group); 28 Mamba layers have constant state. Total memory: a few GB. Throughput stays near 8K-context throughput.

The Jamba bet:

“Most of the work in language modeling is OK with O(N) processing; a small number of attention layers cover the cases where exact recall matters. Combining this with MoE (which improves quality at fixed compute) gives a model that’s smaller, faster, and longer-context than a pure-attention MoE of the same active-param count.”

Empirically: Jamba’s 12B-active performance is on par with Mixtral 8x7B (13B-active) on most benchmarks. At long context, Jamba dominates because Mixtral can’t go past ~32K efficiently.

↳ §20.2 + Jamba paper

Next: §20.3 — Honest assessment. Where SSMs and hybrids win, where they don’t, and what the realistic 2-3 year trajectory looks like.