Summary
mlx_lm/models/cache.py imports tree_flatten, tree_map, tree_unflatten from mlx.utils but the _BaseCache.nbytes property (line 322) calls tree_reduce, which is never imported. Any code path that reads .nbytes on a _BaseCache instance (or a subclass that inherits it) raises:
NameError: name 'tree_reduce' is not defined
Repro
Running qwen3.5-397b-a17b-mlx (a Qwen3.5 MoE, model_type: qwen3_5_moe) in LM Studio 0.4.12+1 with its bundled mlx-lm==0.31.3:
curl -sS -X POST http://127.0.0.1:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-397b-a17b-mlx","stream":false,"max_tokens":20,
"messages":[{"role":"user","content":"Say hello."}]}'
returns:
{"error":"Error in iterating prediction stream: NameError: name 'tree_reduce' is not defined"}
Both streaming and non-streaming paths fail — _BaseCache.nbytes is read before the first token is produced.
Why MoE models surface this
Dense architectures use cache subclasses that override nbytes with their own implementation and never hit _BaseCache.nbytes. MoE architectures (Qwen3.5 MoE, etc.) inherit the base implementation, which is why MoE users see it and dense-model users don't.
Root cause
mlx_lm/models/cache.py line 10:
from mlx.utils import tree_flatten, tree_map, tree_unflatten
line 322:
@property
def nbytes(self):
return tree_reduce(lambda a, x: a + x.nbytes, (self.keys, self.values), 0)
tree_reduce is used but not imported. One-line fix — PR to follow.
Environment
- macOS 15.x, Apple Silicon (M3 Ultra)
mlx==0.31.1, mlx-lm==0.31.3
- Confirmed on
main at 62f38ae
- Also reproduces in LM Studio bundled runtimes 1.4.0, 1.5.0, 1.6.0
Summary
mlx_lm/models/cache.pyimportstree_flatten, tree_map, tree_unflattenfrommlx.utilsbut the_BaseCache.nbytesproperty (line 322) callstree_reduce, which is never imported. Any code path that reads.nbyteson a_BaseCacheinstance (or a subclass that inherits it) raises:Repro
Running
qwen3.5-397b-a17b-mlx(a Qwen3.5 MoE,model_type: qwen3_5_moe) in LM Studio 0.4.12+1 with its bundledmlx-lm==0.31.3:returns:
{"error":"Error in iterating prediction stream: NameError: name 'tree_reduce' is not defined"}Both streaming and non-streaming paths fail —
_BaseCache.nbytesis read before the first token is produced.Why MoE models surface this
Dense architectures use cache subclasses that override
nbyteswith their own implementation and never hit_BaseCache.nbytes. MoE architectures (Qwen3.5 MoE, etc.) inherit the base implementation, which is why MoE users see it and dense-model users don't.Root cause
mlx_lm/models/cache.pyline 10:line 322:
tree_reduceis used but not imported. One-line fix — PR to follow.Environment
mlx==0.31.1,mlx-lm==0.31.3mainat62f38ae