Skip to content

Add pipeline parallel support for Qwen3 MoE and MiniMax models#1138

Open
qubitcontracting wants to merge 2 commits intoml-explore:mainfrom
qubitcontracting:pipeline-model-support
Open

Add pipeline parallel support for Qwen3 MoE and MiniMax models#1138
qubitcontracting wants to merge 2 commits intoml-explore:mainfrom
qubitcontracting:pipeline-model-support

Conversation

@qubitcontracting
Copy link
Copy Markdown

@qubitcontracting qubitcontracting commented Apr 9, 2026

Summary

Adds pipeline parallel support (PipelineMixin) to Qwen3 MoE and MiniMax model architectures.

Changes

mlx_lm/models/qwen3_moe.py

  • Qwen3MoeModel extends PipelineMixin
  • Forward pass: recv_like → pipeline_layers → sendall_gathernorm
  • make_cache() returns KV cache entries only for this rank's layers

mlx_lm/models/minimax.py

  • Same pipeline support as qwen3_moe
  • make_cache() delegation from outer Model class

Testing

Adds PipelineMixin to Qwen3MoeModel and MiniMaxModel, enabling
pipeline parallel inference across multiple nodes.

Changes per model:
- Inherit PipelineMixin for pipeline-aware layer splitting
- Use pipeline_layers instead of layers in forward pass
- Add distributed recv/send between pipeline ranks
- Add all_gather so all ranks have final output for lm_head
- Add make_cache() returning KV cache only for this rank's layers
  (prevents cache-layer count mismatch that causes GPU timeout)

For Qwen3 MoE, also passes return_array=True to create_attention_mask
which is required when the mask flows through distributed ops.

Depends on PipelineMixin from pipeline.py. When pipeline_size == 1
(single node), all distributed ops are skipped and behavior is
identical to the original code.

Tested with:
- Qwen3-235B-A22B-Instruct 8-bit (94 layers, 3 nodes)
- Qwen3-Coder-480B-A35B-Instruct 4-bit (94 layers, 3 nodes)
- MiniMax-M2.5-8bit (80 layers, 3 nodes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qubitcontracting qubitcontracting marked this pull request as draft April 9, 2026 12:41
- KVCache() takes no arguments in 0.31.2 (was head_dim, num_kv_heads)
- Add make_cache() to outer Model class that delegates to inner model
  (make_prompt_cache calls model.make_cache(), not model.model.make_cache())

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qubitcontracting qubitcontracting marked this pull request as ready for review April 9, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant