Add pipeline parallel support for Qwen3 MoE and MiniMax models by qubitcontracting · Pull Request #1138 · ml-explore/mlx-lm

qubitcontracting · 2026-04-09T10:32:43Z

Summary

Adds pipeline parallel support (PipelineMixin) to Qwen3 MoE and MiniMax model architectures.

Changes

mlx_lm/models/qwen3_moe.py

Qwen3MoeModel extends PipelineMixin
Forward pass: recv_like → pipeline_layers → send → all_gather → norm
make_cache() returns KV cache entries only for this rank's layers

mlx_lm/models/minimax.py

Same pipeline support as qwen3_moe
make_cache() delegation from outer Model class

Testing

Qwen3-235B-A22B-Instruct-2507-8bit on 2-node (80/14, 17.8 tok/s) and 3-node (70/14/10, 11.3 tok/s)
Tested on mlx-lm 0.31.1 and 0.31.2
Requires PR Pipeline parallel: memory-proportional splitting and inference sync #1137 for pipeline infrastructure

Adds PipelineMixin to Qwen3MoeModel and MiniMaxModel, enabling pipeline parallel inference across multiple nodes. Changes per model: - Inherit PipelineMixin for pipeline-aware layer splitting - Use pipeline_layers instead of layers in forward pass - Add distributed recv/send between pipeline ranks - Add all_gather so all ranks have final output for lm_head - Add make_cache() returning KV cache only for this rank's layers (prevents cache-layer count mismatch that causes GPU timeout) For Qwen3 MoE, also passes return_array=True to create_attention_mask which is required when the mask flows through distributed ops. Depends on PipelineMixin from pipeline.py. When pipeline_size == 1 (single node), all distributed ops are skipped and behavior is identical to the original code. Tested with: - Qwen3-235B-A22B-Instruct 8-bit (94 layers, 3 nodes) - Qwen3-Coder-480B-A35B-Instruct 4-bit (94 layers, 3 nodes) - MiniMax-M2.5-8bit (80 layers, 3 nodes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- KVCache() takes no arguments in 0.31.2 (was head_dim, num_kv_heads) - Add make_cache() to outer Model class that delegates to inner model (make_prompt_cache calls model.make_cache(), not model.model.make_cache()) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qubitcontracting marked this pull request as draft April 9, 2026 12:41

qubitcontracting marked this pull request as ready for review April 9, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline parallel support for Qwen3 MoE and MiniMax models#1138

Add pipeline parallel support for Qwen3 MoE and MiniMax models#1138
qubitcontracting wants to merge 2 commits intoml-explore:mainfrom
qubitcontracting:pipeline-model-support

qubitcontracting commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qubitcontracting commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qubitcontracting commented Apr 9, 2026 •

edited

Loading