You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)
Summary
Prompt prefix caching — the mechanism that reuses computed KV states across requests sharing a common prefix — only works for pure full-attention models. Any model using sliding window attention, Mamba/SSM layers, or mixed attention types silently falls back to full prompt recomputation on every request. This makes multi-turn conversations unusably slow for the majority of modern open-weight models.
This is not a single bug but a systemic gap in how mlx-lm handles non-standard cache types. I'm filing this as a unifying issue because the symptoms are spread across many separate reports that all trace back to the same root causes.
Empirical Evidence
Tested on Mac Studio M3 Ultra (512GB unified memory), LM Studio 0.4.6, mlx-engine. Three sequential requests with identical system prompts, different user messages, max_tokens=30:
MiniMax M2.5 (pure attention, MoE) — caching works
Request
Time
Speedup
1 (cold)
29.33s
—
2 (warm)
6.15s
4.8x
3 (warm)
2.79s
10.5x
GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching
Request
Time
Speedup
1 (cold)
1.54s
—
2 (warm)
1.77s
none
3 (warm)
1.67s
none
Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching
Request
Time
Speedup
1 (cold)
5.02s
—
2 (warm)
7.76s
none (slower)
3 (warm)
8.00s
none (slower)
MiniMax shows clear prefix reuse (29s → 3s). GPT-OSS and Qwen 3.5 show zero improvement — the full prompt is recomputed every turn.
Root Cause Analysis
There are two distinct failure modes, both stemming from the assumption that all layers use identical, trimmable KV caches:
1. Sliding window models → RotatingKVCache can't be trimmed
Models like GPT-OSS 120B and Gemma 3 27B alternate between sliding window and full attention layers:
Sliding window layers use RotatingKVCache (circular buffer). When the cache wrapper attempts to trim to a common prefix for reuse, the circular buffer state can't be meaningfully trimmed — so the entire cache is erased and recomputed from scratch. See lmstudio-ai/mlx-engine#177.
2. SSM/Mamba hybrid models → non-trimmable state
Models like Qwen 3.5 (all sizes) use attention + Mamba layers:
# Qwen 3.5 — hybrid attention + SSM
Attention layers: standard KVCache (trimmable)
Mamba/SSM layers: recurrent state (NOT trimmable)
The Mamba state is fundamentally different from a KV cache — it's a compressed recurrent state that can't be split at an arbitrary token boundary. Additionally, the KVCache.make_mask() interface requires window_size and return_array arguments that don't apply to SSM state, causing TypeError on multi-turn prefill (see QwenLM/Qwen3.6#37).
Affected Models
Every popular hybrid-architecture model is affected. This covers the majority of modern open-weight models:
Model
Architecture
Cache behavior
Qwen 3.5 (all sizes)
Attention + Mamba/SSM
Broken — crashes or no reuse
GPT-OSS 120B / 20B
Sliding + full attention
Broken — full recompute
Gemma 3 (all sizes)
5:1 sliding + global
Broken — full recompute
Llama 4 Scout/Maverick
iRoPE chunked (8K) + NoPE
Likely broken
Qwen2.5-VL
Partial sliding window
Likely broken
MiniMax M2.5
Pure full attention (MoE)
Works
As of March 2026, MiniMax M2.5 appears to be the only major model where prefix caching works correctly on MLX.
Impact
This is particularly painful for agentic workloads where:
System prompts are large (tool definitions, personas, instructions)
Conversations are multi-turn (each turn should only process new tokens)
Multiple agents share the same model (each request recomputes from scratch)
Without prefix caching, a 40K-token context takes ~200 seconds to process vs ~5 seconds with cache reuse. For agentic frameworks running on local MLX models, this is the difference between usable and unusable.
Proposed Solution
Implement per-layer cache logic instead of assuming uniform cache types:
For sliding window layers: Either make RotatingKVCache trimmable to a prefix boundary, or maintain a parallel standard cache for the prefix portion that gets replayed into the rotating buffer on reuse.
Cache type introspection: The cache wrapper should inspect what type of cache each layer requires and handle trim/reuse differently per type.
PR #923 proposes a partial fix for Qwen 3.5 specifically. The RotatingKVCache trim issue has PRs at lmstudio-ai/mlx-engine#188 and #192. These should be unified into a comprehensive solution.
Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba)
Summary
Prompt prefix caching — the mechanism that reuses computed KV states across requests sharing a common prefix — only works for pure full-attention models. Any model using sliding window attention, Mamba/SSM layers, or mixed attention types silently falls back to full prompt recomputation on every request. This makes multi-turn conversations unusably slow for the majority of modern open-weight models.
This is not a single bug but a systemic gap in how mlx-lm handles non-standard cache types. I'm filing this as a unifying issue because the symptoms are spread across many separate reports that all trace back to the same root causes.
Empirical Evidence
Tested on Mac Studio M3 Ultra (512GB unified memory), LM Studio 0.4.6, mlx-engine. Three sequential requests with identical system prompts, different user messages,
max_tokens=30:MiniMax M2.5 (pure attention, MoE) — caching works
GPT-OSS 120B (sliding_attention + full_attention hybrid) — no caching
Qwen 3.5 9B (attention + Mamba/SSM hybrid) — no caching
MiniMax shows clear prefix reuse (29s → 3s). GPT-OSS and Qwen 3.5 show zero improvement — the full prompt is recomputed every turn.
Root Cause Analysis
There are two distinct failure modes, both stemming from the assumption that all layers use identical, trimmable KV caches:
1. Sliding window models → RotatingKVCache can't be trimmed
Models like GPT-OSS 120B and Gemma 3 27B alternate between sliding window and full attention layers:
Sliding window layers use
RotatingKVCache(circular buffer). When the cache wrapper attempts to trim to a common prefix for reuse, the circular buffer state can't be meaningfully trimmed — so the entire cache is erased and recomputed from scratch. See lmstudio-ai/mlx-engine#177.2. SSM/Mamba hybrid models → non-trimmable state
Models like Qwen 3.5 (all sizes) use attention + Mamba layers:
The Mamba state is fundamentally different from a KV cache — it's a compressed recurrent state that can't be split at an arbitrary token boundary. Additionally, the
KVCache.make_mask()interface requireswindow_sizeandreturn_arrayarguments that don't apply to SSM state, causingTypeErroron multi-turn prefill (see QwenLM/Qwen3.6#37).Affected Models
Every popular hybrid-architecture model is affected. This covers the majority of modern open-weight models:
As of March 2026, MiniMax M2.5 appears to be the only major model where prefix caching works correctly on MLX.
Impact
This is particularly painful for agentic workloads where:
Without prefix caching, a 40K-token context takes ~200 seconds to process vs ~5 seconds with cache reuse. For agentic frameworks running on local MLX models, this is the difference between usable and unusable.
Proposed Solution
Implement per-layer cache logic instead of assuming uniform cache types:
For sliding window layers: Either make
RotatingKVCachetrimmable to a prefix boundary, or maintain a parallel standard cache for the prefix portion that gets replayed into the rotating buffer on reuse.For SSM/Mamba layers: Use
make_prompt_cache(model)(which correctly createsArrayCachefor linear attention layers) instead of uniformKVCache()allocation. The workaround in [Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework QwenLM/Qwen3.6#37 demonstrates this works for Qwen 3.5 at the code level.Cache type introspection: The cache wrapper should inspect what type of cache each layer requires and handle trim/reuse differently per type.
PR #923 proposes a partial fix for Qwen 3.5 specifically. The RotatingKVCache trim issue has PRs at lmstudio-ai/mlx-engine#188 and #192. These should be unified into a comprehensive solution.
Related Issues
RotatingKVCachetrim behavior causes context overflow policies to always erase the whole cache lmstudio-ai/mlx-engine#177 — RotatingKVCache trim erases entire cacheEnvironment