Skip to content

Support for inferencerlabs Q4.8-INF / glm_moe_dsa quantization (2170 unmatched params) #1259

@trevorgordon981

Description

@trevorgordon981

Summary

Loading inferencerlabs/GLM-5.1-MLX-4.8bit-INF (model_type glm_moe_dsa) with stock mlx-lm 0.31.3 fails because the model contains 2170 quantized parameters that the stock glm_moe_dsa model definition doesn't know how to map. The model is tagged with library_name: mlx and tags: [mlx, safetensors, glm_moe_dsa, quantized], so users reasonably expect it to load with stock mlx-lm.

Repro

pip install mlx-lm==0.31.3
hf download inferencerlabs/GLM-5.1-MLX-4.8bit-INF --local-dir ./glm51
python -m mlx_lm generate --model ./glm51 --prompt "hi" --max-tokens 5

Error

File "mlx_lm/utils.py", line 415, in load_model
    model.load_weights(list(weights.items()), strict=strict)
File "mlx/nn/layers/base.py", line 185, in load_weights
    raise ValueError(...)
ValueError: Received 2170 parameters not in model:
  lm_head.biases, lm_head.scales,
  model.embed_tokens.biases, model.embed_tokens.scales,
  model.layers.0.mlp.down_proj.biases, .scales,
  model.layers.0.mlp.gate_proj.biases, .scales,
  model.layers.0.mlp.up_proj.biases, .scales,
  model.layers.0.self_attn.indexer.weights_proj.biases, .scales,
  model.layers.0.self_attn.indexer.wk.biases, .scales,
  model.layers.0.self_attn.indexer.wq_b.biases, .scales,
  model.layers.0.self_attn.kv_a_proj_with_mqa.biases, .scales,
  model.layers.0.self_attn.o_proj.biases, .scales,
  model.layers.0.self_attn.q_a_proj.biases, .scales,
  model.layers.0.self_attn.q_b_proj.biases, .scales,
  model.layers.0.self_attn.unembed_out.biases, .scales,
  ... (across all 78 layers)

The .biases/.scales suffixes indicate MLX-quantized params on tensors that the stock mlx_lm/models/glm_moe_dsa.py doesn't quantize. Notably the DSA indexer (weights_proj, wk, wq_b) and unembed_out are present in the saved weights but the stock model definition either doesn't have them quantized or doesn't have them at all in the same place.

Context

The model card states:

Quantized with a modified version of MLX

The Q4.8-INF format appears to be a custom quantization scheme from inferencerlabs that quantizes more components than the stock mlx_lm glm_moe_dsa definition expects. The fork is not publicly linked.

Asks

  1. Could glm_moe_dsa in stock mlx-lm be extended to handle the additional quantized params (treating them as quantized indexer / lm_head / embed_tokens), or is this format incompatible by design?
  2. If incompatible, would a clearer error message be useful — e.g., "this model requires a non-standard mlx fork" rather than a 2170-line param dump?
  3. Pointer to which fork actually loads this would help the broader MLX user base.

Tested on M3 Ultra 512 GB, mlx-lm 0.31.3, mlx_vlm 0.4.4, Python 3.12.13.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions