Support for inferencerlabs Q4.8-INF / glm_moe_dsa quantization (2170 unmatched params)

## Summary

Loading `inferencerlabs/GLM-5.1-MLX-4.8bit-INF` (model_type `glm_moe_dsa`) with stock mlx-lm 0.31.3 fails because the model contains 2170 quantized parameters that the stock `glm_moe_dsa` model definition doesn't know how to map. The model is tagged with `library_name: mlx` and `tags: [mlx, safetensors, glm_moe_dsa, quantized]`, so users reasonably expect it to load with stock mlx-lm.

## Repro

```
pip install mlx-lm==0.31.3
hf download inferencerlabs/GLM-5.1-MLX-4.8bit-INF --local-dir ./glm51
python -m mlx_lm generate --model ./glm51 --prompt "hi" --max-tokens 5
```

## Error

```
File "mlx_lm/utils.py", line 415, in load_model
    model.load_weights(list(weights.items()), strict=strict)
File "mlx/nn/layers/base.py", line 185, in load_weights
    raise ValueError(...)
ValueError: Received 2170 parameters not in model:
  lm_head.biases, lm_head.scales,
  model.embed_tokens.biases, model.embed_tokens.scales,
  model.layers.0.mlp.down_proj.biases, .scales,
  model.layers.0.mlp.gate_proj.biases, .scales,
  model.layers.0.mlp.up_proj.biases, .scales,
  model.layers.0.self_attn.indexer.weights_proj.biases, .scales,
  model.layers.0.self_attn.indexer.wk.biases, .scales,
  model.layers.0.self_attn.indexer.wq_b.biases, .scales,
  model.layers.0.self_attn.kv_a_proj_with_mqa.biases, .scales,
  model.layers.0.self_attn.o_proj.biases, .scales,
  model.layers.0.self_attn.q_a_proj.biases, .scales,
  model.layers.0.self_attn.q_b_proj.biases, .scales,
  model.layers.0.self_attn.unembed_out.biases, .scales,
  ... (across all 78 layers)
```

The `.biases`/`.scales` suffixes indicate MLX-quantized params on tensors that the stock `mlx_lm/models/glm_moe_dsa.py` doesn't quantize. Notably the DSA `indexer` (weights_proj, wk, wq_b) and `unembed_out` are present in the saved weights but the stock model definition either doesn't have them quantized or doesn't have them at all in the same place.

## Context

The model card states:
> Quantized with a modified version of MLX

The Q4.8-INF format appears to be a custom quantization scheme from inferencerlabs that quantizes more components than the stock `mlx_lm` `glm_moe_dsa` definition expects. The fork is not publicly linked.

## Asks

1. Could `glm_moe_dsa` in stock mlx-lm be extended to handle the additional quantized params (treating them as quantized indexer / lm_head / embed_tokens), or is this format incompatible by design?
2. If incompatible, would a clearer error message be useful — e.g., "this model requires a non-standard mlx fork" rather than a 2170-line param dump?
3. Pointer to which fork actually loads this would help the broader MLX user base.

Tested on M3 Ultra 512 GB, mlx-lm 0.31.3, mlx_vlm 0.4.4, Python 3.12.13.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for inferencerlabs Q4.8-INF / glm_moe_dsa quantization (2170 unmatched params) #1259

Summary

Repro

Error

Context

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for inferencerlabs Q4.8-INF / glm_moe_dsa quantization (2170 unmatched params) #1259

Description

Summary

Repro

Error

Context

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions