feat(moe): parallel execution prefetch queue for SSD expert streaming#8
Merged
feat(moe): parallel execution prefetch queue for SSD expert streaming#8
Conversation
* Add doc comment verification script and CI step * Discover doc verification targets dynamically and report all failures
Support multiple parallel tool calls and buffering for Llama 3
Llama 3 natively supports tool calling through an ipython environment which
generates arrays for multiple parallel tool invocations. Depending on the
model size and prompt, it generates either a JSON list of function objects
or a python-style array of function calls.
- Sets `startTag` to `<|python_tag|>` to ensure `ToolCallProcessor`
correctly buffers tool output without leaking it to the streaming UI.
- Upgrades `Llama3ToolCallParser` to parse multiple parallel tool calls
from JSON array payloads `[{"name": ...}]` during `parseEOS`.
- Upgrades `PythonicToolCallParser` to extract multiple sequential
pythonic function calls `[func1(), func2()]` via `parseEOS`.
- Refactors `PythonicToolCallParser` to use modern high-performance
Swift 5.7+ Regex literals instead of legacy NSRegularExpression.
- Add integration unit tests for both parsers to verify multi-call arrays.
…models (ml-explore#142) * Add Mistral3, Nemotron, and Qwen3.5 tool call integration test helpers * Add MLXLMIntegrationTests * Update documentation * Use actual asserts in IntegrationTestHelpers.swift --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tions spm linkage
Avoid converting decoded JSON arguments through Any before constructing ToolCall.Function.\n\nThis fixes Swift 6 Sendable compilation errors in Llama3ToolCallParser during Release iOS builds.\n\nWritten by Codex.
…ations, and Gemma4VL support
…git Softcapping in Gemma4 configuration
…ping and KVCache alignments
…ent dense layout allocation crashes
…resolve missing xctest failure
…MLXArrays to natively resolve 'scale' unpacking
* Add gemma 4 model (text, vision, MoE) =
…ashes This commit resolves the persistent linguistic corruption issues by: 1. Fixing LCache retrieval for shared KV RoPE logic to properly align positional phases. 2. Aligning the Per-Layer Embedding application and RoPE proportional parameters with vLLM references. 3. Ripping out manual lm_head.weight injection from Gemma4VL sanitize that caused unhandled key fatal errors. 4. Enhancing KVCache dimensions and routing handling for Gemma4 architecture.
…utions and PLE conditioning scale preservation
* Add Gemma 4 text model support (E2B and E4B) Port of gemma4.py and gemma4_text.py from mlx-lm. Adds support for Gemma 4's text-only architecture including: - Per-Layer Embeddings (PLE) with gated residual - Shared KV cache across later layers - Dual RoPE (proportional for full attention, default for sliding) - ProportionalRoPE with partial_rotary_factor support - Global head dimensions (512) for full-attention layers - Double-wide MLP for KV-shared layers - Logit softcapping - LoRA support Registers gemma4 and gemma4_text model types plus E2B/E4B 4-bit model configurations. * Fix Gemma 4 model IDs and EOS tokens - Model IDs: use correct HuggingFace repo names (lowercase, no "-lm-") - gemma-4-E2B-it-lm-4bit → gemma-4-e2b-it-4bit - gemma-4-E4B-it-lm-4bit → gemma-4-e4b-it-4bit - EOS token: <end_of_turn> (Gemma 3) → <turn|> (Gemma 4, token ID 106) MLXArray.ones([1]) defaults to float32, which can cause dtype promotion when multiplied with bfloat16/float16 model tensors. Specify .float16 explicitly to avoid hidden AsType nodes. * Fix weight key mapping for language_model property Add @ModuleInfo(key: "language_model") so the property matches the snake_case key in checkpoint weight files. Without this, weight loading fails with keyNotFound for the language_model subtree. Reported-by: john-rocky (PR ml-explore#185 comment) * Address review feedback: remove force unwraps, use shared ProportionalRoPE - Make layerTypes non-optional in config (decode or derive from pattern) - Replace vProj! force unwrap with if let binding - Switch from local ProportionalRoPE to shared initializeRope() factory - Remove 60-line local ProportionalRoPE class (now in RoPEUtils.swift) --------- Co-authored-by: Stefan Geens <stefan.geens@gmail.com>
…scription interpretation natively
…erflow shattering on Apple Silicon without compromising base magnitudes
- see ml-explore#189 - download jinja files in all paths - macros use fully qualified type names
…ing structures before upstream sync
…s Gemma4 structures
…wift (MLXVLM) Upstream ml-explore/mlx-swift-lm now ships native Gemma4 VLM support with clean text/vision separation. Our custom Gemma4VL.swift is no longer needed. SSD streaming, speculative decoding, and Load.swift router patches are preserved.
…N-decoded configuration
…sed dependencies
…invalid identifiers)
…de xctest bundles
…ppleParavirtCommandBuffer concurrency assertions on macOS runners
…solve Metal Paravirt concurrency faults
… prevent interleaving parameterized Metal invocations
…MLXTestingSuite trait to definitively fix Paravirt bounds assertion
…te using matmul instead of element-wise multiplication
… Swift 6 preconcurrency warnings
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.