Hybrid architectures — Jamba, Zamba, Samba, RWKV
By mid-2024, the empirical lesson was clear: pure-Mamba and pure-attention each have failure modes the other doesn’t. Pure Mamba misses on tasks needing precise long-range recall (the fixed-size state can’t hold everything). Pure attention scales quadratically. The fix that’s now ubiquitous: hybrid stacks. Most layers run Mamba; a few strategically-placed attention layers handle the cases Mamba misses. Jamba, Zamba, Samba, Granite 4, and the smaller-scale Falcon-Mamba all follow this pattern, varying mostly in the Mamba-to-attention ratio. This section covers the rationale, the canonical configurations, and the closely-related RWKV lineage which arrives at similar performance from a different starting point.
Why hybrids
The case for each architecture in isolation:
- Pure Mamba: linear-time, fixed-size state. Wins on long sequences when content can be summarised. Loses on tasks where every token might be recalled (e.g., “what was the 17th word in paragraph 3?”).
- Pure attention: O(N²) but exact recall. Wins on tasks needing precise long-range lookup. Loses on cost at long context.
The hybrid bet: most computation can be Mamba-cheap; rare attention-required steps can use a few attention layers. The total cost is dominated by the cheap Mamba layers; the quality is recovered by the attention layers.
Hybrid architectures have come to dominate the SSM space because they get most of the linear-time benefit while keeping a quality floor.
The canonical configurations
The common element: attention is ~10% of layers; Mamba is ~90%. The few attention layers handle the recall-intensive parts of computation; the many Mamba layers do everything else cheaply.
What’s failing in pure Mamba:
The Mamba state is a fixed-size vector x ∈ ℝ^(N·D). Typical Mamba: N=16, D=4096 → 64K-element state per layer. This state is a COMPRESSION of the entire prefix.
For most language modeling tasks, this compression is fine — the model only needs aggregate context (topic, style, recent entity names). The state’s lossy compression preserves this.
For “Needle in a Haystack” tasks (place a specific fact deep in a 100K-token context, then ask about it), the needle’s information must SURVIVE in the state. With a 64K-element state compressing 100K tokens × 4096 dim = 410M numbers of input, the compression ratio is 6400×. Specific factoids get washed out.
Result: pure Mamba scores 60-80% on Needle in a Haystack at long context, vs Llama-3’s 95-99%.
What hybrid layers fix:
Even one or two attention layers in the stack have access to the FULL prefix via the KV cache — no information is compressed away. They can “look up” the specific factoid directly.
The Mamba layers do the bulk of the computation (cheap); when the model needs precise recall (a specific token’s value), the attention layers handle it (exact).
With ~10% attention layers, Needle in a Haystack scores recover to 90-95% — close to pure attention’s 95-99% at a fraction of the total cost.
The cost math:
10% attention + 90% Mamba in a 32-layer model = 3-4 attention layers, 28-29 Mamba layers.
At T=256K: attention layers cost ~256K² · 4096 ≈ 270 TFLOPs total (vs 80 TFLOPs for the Mamba part). Wait — that’s attention-dominated. Let me reconsider.
Actually the cost ratio depends on T. At T=8K: attention is small (~2 TFLOPs total) vs Mamba (~80 TFLOPs total). At T=256K: attention dominates (~270 vs 80).
For very long contexts where pure attention is impractical, hybrids use SLIDING-WINDOW attention in the attention layers (Samba’s design): attention only attends to the previous 4K tokens, making the attention layer cost ~O(N · 4K) instead of O(N²). Now total cost stays linear in N for ANY ratio of attention.
The bet: a few attention layers (even with sliding window) provide enough recall capability to handle the cases pure Mamba fails on, while most of the compute remains O(N).
RWKV — the other lineage
Peng 2023 “RWKV: Reinventing RNNs for the Transformer Era” arrived at similar conclusions from a different starting point. RWKV is a “linear attention” formulation that:
RWKV is structurally distinct from Mamba — it doesn’t have an explicit “state” vector; instead, it maintains a running weighted aggregate of (k, v) pairs. But it shares the core property: linear-time, content-addressable, no quadratic attention.
RWKV’s distinctive feature: dual-mode. During training, it can be computed in PARALLEL like a transformer (all positions at once). During inference, it can be computed in SERIES like an RNN (token by token, with constant state). This dual-mode property is rare and operationally valuable — train fast, deploy efficient.
The mechanism:
For each timestep t, compute r_t (receptance), k_t (key), v_t (value) — analogous to attention’s R, K, V but used differently.
The aggregation: out_t depends on a SUM over all prior timesteps s ≤ t:
A_t = Σ_(s ≤ t) exp(w · (t-s) + k_s) · v_s
This looks like attention (sum over keys weighted by similarity), but the WEIGHTING is exp(w · (t-s) + k_s) — an exponential function with two parts:
- w · (t-s): linear decay with distance. w is a (typically negative) trainable parameter. The larger |w|, the more aggressive the decay; the model “forgets” older tokens.
- k_s: per-token “salience” — a learned function of the input at time s. High k_s means “remember me longer”; low k_s means “I’m not important.”
Operationally: each prior token contributes to A_t with weight that depends on (1) how long ago it was (exp(w · (t-s)) decays with distance), and (2) how important it was (exp(k_s) amplifies salient tokens).
Why this is O(T):
The naive computation of A_t for each t needs O(T) operations (sum over all prior tokens). Over T tokens, that’s O(T²) — same as attention.
But the sum has a RECURRENT structure: A_t = exp(w) · A_(t-1) + exp(k_t) · v_t (after a normalisation step).
This is a linear recurrence on a running “weighted sum” state. We can maintain it in O(1) per step:
numer_t = exp(w) · numer_(t-1) + exp(k_t) · v_t
denom_t = exp(w) · denom_(t-1) + exp(k_t)
A_t = numer_t / denom_t
Out_t = sigmoid(r_t) · A_t.
The state is just (numer_t, denom_t) — two vectors of size d. Total cost: O(T · d). Linear in T.
Comparison to Mamba:
Both achieve O(T). Both have a “fixed-size state” that summarises the prefix. The difference is the mathematical form:
- Mamba: x_t = A · x_(t-1) + B · u_t (general linear recurrence with input-dependent B, C).
- RWKV: state is a (numerator, denominator) pair maintaining an exponentially-weighted sum.
RWKV’s exp-decay structure is more rigid (the time weighting is always exponential), but easier to train and implement. Mamba is more flexible but requires the specialised selective-scan kernel.
Empirically: RWKV and Mamba achieve similar perplexity at similar parameter counts. The choice between them is mostly engineering (kernel availability, fine-tuning ecosystem) rather than fundamental capability.
What’s actually shipping in production
Jamba’s structure:
32 total layers. Each “Jamba block” is a sequence of:
- 1 attention layer
- 7 Mamba layers
That’s 4 attention layers and 28 Mamba layers total. Ratio: 1:7.
Of those layers, half are MoE-d (the FFN replaced with 16 experts, top-2 routing). The other half are dense FFNs. MoE is applied to both Mamba and attention layers’ FFN sublayers, not to the SSM or attention computation itself.
Total parameters: 52B if all experts are counted. Active per token: ~12B.
Comparison to Mixtral 8x7B:
- Mixtral: 8 experts × 7B per expert, top-2 routing, all 32 layers are pure attention.
- Jamba: 16 experts (in MoE’d layers), 28/32 layers are Mamba (cheap), 4/32 are attention.
Per-token cost at T = 8K (typical):
- Mixtral: 32 attention layers × O(8K · d) compute each. Active params ~13B used per token. Cost dominated by FFN + attention.
- Jamba: 4 attention layers × O(8K · d) + 28 Mamba layers × O(d · N · D) for fixed N. Mamba layers are essentially constant-per-token cost. Attention is much smaller fraction of compute. Per-token compute ~30-40% less than Mixtral for similar quality.
Per-token cost at T = 128K (long context):
- Mixtral: doesn’t fit (~64 GB attention matrices alone); would need extreme optimisation. KV cache: ~hundreds of GB.
- Jamba: handles it easily. 4 attention layers’ KV cache is small (each layer’s KV is for one head group); 28 Mamba layers have constant state. Total memory: a few GB. Throughput stays near 8K-context throughput.
The Jamba bet:
“Most of the work in language modeling is OK with O(N) processing; a small number of attention layers cover the cases where exact recall matters. Combining this with MoE (which improves quality at fixed compute) gives a model that’s smaller, faster, and longer-context than a pure-attention MoE of the same active-param count.”
Empirically: Jamba’s 12B-active performance is on par with Mixtral 8x7B (13B-active) on most benchmarks. At long context, Jamba dominates because Mixtral can’t go past ~32K efficiently.
Next: §20.3 — Honest assessment. Where SSMs and hybrids win, where they don’t, and what the realistic 2-3 year trajectory looks like.