Skip to content

feat(moe): parallel execution prefetch queue for SSD expert streaming#8

Merged
solderzzc merged 66 commits intomainfrom
feature/papps-ssd-streaming
Apr 14, 2026
Merged

feat(moe): parallel execution prefetch queue for SSD expert streaming#8
solderzzc merged 66 commits intomainfrom
feature/papps-ssd-streaming

Conversation

@solderzzc
Copy link
Copy Markdown
Member

No description provided.

DePasqualeOrg and others added 30 commits April 6, 2026 13:55
* Add doc comment verification script and CI step
* Discover doc verification targets dynamically and report all failures
Support multiple parallel tool calls and buffering for Llama 3

Llama 3 natively supports tool calling through an ipython environment which
generates arrays for multiple parallel tool invocations. Depending on the
model size and prompt, it generates either a JSON list of function objects
or a python-style array of function calls.

- Sets `startTag` to `<|python_tag|>` to ensure `ToolCallProcessor`
  correctly buffers tool output without leaking it to the streaming UI.
- Upgrades `Llama3ToolCallParser` to parse multiple parallel tool calls
  from JSON array payloads `[{"name": ...}]` during `parseEOS`.
- Upgrades `PythonicToolCallParser` to extract multiple sequential
  pythonic function calls `[func1(), func2()]` via `parseEOS`.
- Refactors `PythonicToolCallParser` to use modern high-performance
  Swift 5.7+ Regex literals instead of legacy NSRegularExpression.
- Add integration unit tests for both parsers to verify multi-call arrays.
…models (ml-explore#142)

* Add Mistral3, Nemotron, and Qwen3.5 tool call integration test helpers
* Add MLXLMIntegrationTests
* Update documentation

* Use actual asserts in IntegrationTestHelpers.swift

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Avoid converting decoded JSON arguments through Any before constructing ToolCall.Function.\n\nThis fixes Swift 6 Sendable compilation errors in Llama3ToolCallParser during Release iOS builds.\n\nWritten by Codex.
Aegis-AI and others added 27 commits April 12, 2026 16:34
…MLXArrays to natively resolve 'scale' unpacking
* Add gemma 4 model (text, vision, MoE)
=
…ashes

This commit resolves the persistent linguistic corruption issues by:
1. Fixing LCache retrieval for shared KV RoPE logic to properly align positional phases.
2. Aligning the Per-Layer Embedding application and RoPE proportional parameters with vLLM references.
3. Ripping out manual lm_head.weight injection from Gemma4VL sanitize that caused unhandled key fatal errors.
4. Enhancing KVCache dimensions and routing handling for Gemma4 architecture.
…utions and PLE conditioning scale preservation
* Add Gemma 4 text model support (E2B and E4B)

Port of gemma4.py and gemma4_text.py from mlx-lm. Adds support for
Gemma 4's text-only architecture including:
- Per-Layer Embeddings (PLE) with gated residual
- Shared KV cache across later layers
- Dual RoPE (proportional for full attention, default for sliding)
- ProportionalRoPE with partial_rotary_factor support
- Global head dimensions (512) for full-attention layers
- Double-wide MLP for KV-shared layers
- Logit softcapping
- LoRA support

Registers gemma4 and gemma4_text model types plus E2B/E4B 4-bit
model configurations.

* Fix Gemma 4 model IDs and EOS tokens

- Model IDs: use correct HuggingFace repo names (lowercase, no "-lm-")
  - gemma-4-E2B-it-lm-4bit → gemma-4-e2b-it-4bit
  - gemma-4-E4B-it-lm-4bit → gemma-4-e4b-it-4bit
- EOS token: <end_of_turn> (Gemma 3) → <turn|> (Gemma 4, token ID 106)

MLXArray.ones([1]) defaults to float32, which can cause dtype
promotion when multiplied with bfloat16/float16 model tensors.
Specify .float16 explicitly to avoid hidden AsType nodes.

* Fix weight key mapping for language_model property

Add @ModuleInfo(key: "language_model") so the property matches the
snake_case key in checkpoint weight files. Without this, weight loading
fails with keyNotFound for the language_model subtree.

Reported-by: john-rocky (PR ml-explore#185 comment)

* Address review feedback: remove force unwraps, use shared ProportionalRoPE

- Make layerTypes non-optional in config (decode or derive from pattern)
- Replace vProj! force unwrap with if let binding
- Switch from local ProportionalRoPE to shared initializeRope() factory
- Remove 60-line local ProportionalRoPE class (now in RoPEUtils.swift)

---------

Co-authored-by: Stefan Geens <stefan.geens@gmail.com>
…erflow shattering on Apple Silicon without compromising base magnitudes
- see ml-explore#189
- download jinja files in all paths
- macros use fully qualified type names
…wift (MLXVLM)

Upstream ml-explore/mlx-swift-lm now ships native Gemma4 VLM support
with clean text/vision separation. Our custom Gemma4VL.swift is no
longer needed. SSD streaming, speculative decoding, and Load.swift
router patches are preserved.
…ppleParavirtCommandBuffer concurrency assertions on macOS runners
… prevent interleaving parameterized Metal invocations
…MLXTestingSuite trait to definitively fix Paravirt bounds assertion
…te using matmul instead of element-wise multiplication
@solderzzc solderzzc merged commit 8c4a4f2 into main Apr 14, 2026
4 checks passed
@solderzzc solderzzc deleted the feature/papps-ssd-streaming branch April 14, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants