Summary
Loading inferencerlabs/GLM-5.1-MLX-4.8bit-INF (model_type glm_moe_dsa) with stock mlx-lm 0.31.3 fails because the model contains 2170 quantized parameters that the stock glm_moe_dsa model definition doesn't know how to map. The model is tagged with library_name: mlx and tags: [mlx, safetensors, glm_moe_dsa, quantized], so users reasonably expect it to load with stock mlx-lm.
Repro
pip install mlx-lm==0.31.3
hf download inferencerlabs/GLM-5.1-MLX-4.8bit-INF --local-dir ./glm51
python -m mlx_lm generate --model ./glm51 --prompt "hi" --max-tokens 5
Error
File "mlx_lm/utils.py", line 415, in load_model
model.load_weights(list(weights.items()), strict=strict)
File "mlx/nn/layers/base.py", line 185, in load_weights
raise ValueError(...)
ValueError: Received 2170 parameters not in model:
lm_head.biases, lm_head.scales,
model.embed_tokens.biases, model.embed_tokens.scales,
model.layers.0.mlp.down_proj.biases, .scales,
model.layers.0.mlp.gate_proj.biases, .scales,
model.layers.0.mlp.up_proj.biases, .scales,
model.layers.0.self_attn.indexer.weights_proj.biases, .scales,
model.layers.0.self_attn.indexer.wk.biases, .scales,
model.layers.0.self_attn.indexer.wq_b.biases, .scales,
model.layers.0.self_attn.kv_a_proj_with_mqa.biases, .scales,
model.layers.0.self_attn.o_proj.biases, .scales,
model.layers.0.self_attn.q_a_proj.biases, .scales,
model.layers.0.self_attn.q_b_proj.biases, .scales,
model.layers.0.self_attn.unembed_out.biases, .scales,
... (across all 78 layers)
The .biases/.scales suffixes indicate MLX-quantized params on tensors that the stock mlx_lm/models/glm_moe_dsa.py doesn't quantize. Notably the DSA indexer (weights_proj, wk, wq_b) and unembed_out are present in the saved weights but the stock model definition either doesn't have them quantized or doesn't have them at all in the same place.
Context
The model card states:
Quantized with a modified version of MLX
The Q4.8-INF format appears to be a custom quantization scheme from inferencerlabs that quantizes more components than the stock mlx_lm glm_moe_dsa definition expects. The fork is not publicly linked.
Asks
- Could
glm_moe_dsa in stock mlx-lm be extended to handle the additional quantized params (treating them as quantized indexer / lm_head / embed_tokens), or is this format incompatible by design?
- If incompatible, would a clearer error message be useful — e.g., "this model requires a non-standard mlx fork" rather than a 2170-line param dump?
- Pointer to which fork actually loads this would help the broader MLX user base.
Tested on M3 Ultra 512 GB, mlx-lm 0.31.3, mlx_vlm 0.4.4, Python 3.12.13.
Summary
Loading
inferencerlabs/GLM-5.1-MLX-4.8bit-INF(model_typeglm_moe_dsa) with stock mlx-lm 0.31.3 fails because the model contains 2170 quantized parameters that the stockglm_moe_dsamodel definition doesn't know how to map. The model is tagged withlibrary_name: mlxandtags: [mlx, safetensors, glm_moe_dsa, quantized], so users reasonably expect it to load with stock mlx-lm.Repro
Error
The
.biases/.scalessuffixes indicate MLX-quantized params on tensors that the stockmlx_lm/models/glm_moe_dsa.pydoesn't quantize. Notably the DSAindexer(weights_proj, wk, wq_b) andunembed_outare present in the saved weights but the stock model definition either doesn't have them quantized or doesn't have them at all in the same place.Context
The model card states:
The Q4.8-INF format appears to be a custom quantization scheme from inferencerlabs that quantizes more components than the stock
mlx_lmglm_moe_dsadefinition expects. The fork is not publicly linked.Asks
glm_moe_dsain stock mlx-lm be extended to handle the additional quantized params (treating them as quantized indexer / lm_head / embed_tokens), or is this format incompatible by design?Tested on M3 Ultra 512 GB, mlx-lm 0.31.3, mlx_vlm 0.4.4, Python 3.12.13.