Auto-detect LFM2.5 VL MLX models#57
Merged
Merged
Conversation
d612286 to
4398222
Compare
solderzzc
added a commit
that referenced
this pull request
Apr 16, 2026
Three root causes addressed: 1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while build_and_unit_test uses v3, causing a full rebuild on every run. Unified all jobs to v3. 2. Wrong pre-downloaded model: speculative-decoding job pre-fetched Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to Qwen3.5-2B-4bit. The 9B model was never used by the test, and the 2B model was not cached, causing re-download + OOM on 7 GB runner. Pre-download now fetches 2B + 0.8B (matching the test defaults). 4. Fragile metallib compile: replaced python setup.py build_ext with the proven pip install mlx approach already used in build_and_unit_test — eliminates 5-min compile step and pybind11/cmake dependency chain. speculative-decoding-eval retains the 9B model for its heavier memory evaluation (continue-on-error: true already covers OOM). Fixes: CI failure on PR #57
solderzzc
added a commit
that referenced
this pull request
Apr 16, 2026
Three root causes addressed: 1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while build_and_unit_test uses v3, causing a full rebuild on every run. Unified all jobs to v3. 2. Wrong pre-downloaded model: speculative-decoding job pre-fetched Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to Qwen3.5-2B-4bit. The 9B model was never used by the test, and the 2B model was not cached, causing re-download + OOM on 7 GB runner. Pre-download now fetches 2B + 0.8B (matching the test defaults). 4. Fragile metallib compile: replaced python setup.py build_ext with the proven pip install mlx approach already used in build_and_unit_test — eliminates 5-min compile step and pybind11/cmake dependency chain. speculative-decoding-eval retains the 9B model for its heavier memory evaluation (continue-on-error: true already covers OOM). Fixes: CI failure on PR #57
solderzzc
added a commit
that referenced
this pull request
Apr 16, 2026
Three root causes addressed: 1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while build_and_unit_test uses v3, causing a full rebuild on every run. Unified all jobs to v3. 2. Wrong pre-downloaded model: speculative-decoding job pre-fetched Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to Qwen3.5-2B-4bit. The 9B model was never used by the test, and the 2B model was not cached, causing re-download + OOM on 7 GB runner. Pre-download now fetches 2B + 0.8B (matching the test defaults). 4. Fragile metallib compile: replaced python setup.py build_ext with the proven pip install mlx approach already used in build_and_unit_test — eliminates 5-min compile step and pybind11/cmake dependency chain. speculative-decoding-eval retains the 9B model for its heavier memory evaluation (continue-on-error: true already covers OOM). Fixes: CI failure on PR #57
4398222 to
4a08e7e
Compare
…, 2B model - Use cmake build.sh approach (not python setup.py) for SharpAI fork metallib - Cache key v2 → v3 to share build cache with build_and_unit_test job - Pre-download Qwen3.5-2B-4bit (matches MAIN_MODEL default in test-speculative.sh) not 9B which was never used and caused OOM on 7 GB runner (Trace/BPT trap: 5) - Add model cache step to speculative-decoding-eval (was missing) - hf installed via pip in same venv, source activated before download
4a08e7e to
0317475
Compare
- Reverts auto-detection of vision capabilities from overriding the user's --vision flag in Server.swift - Re-adds qwen3_5 to ModelArchitectureProbe since it technically supports vision (has image_token_id) - By disabling the override, speculative decoding tests (which use Qwen3.5-2B-4bit text paths) will correctly start in text-only mode and avoid the reshape crash
639109d to
7407167
Compare
Auto-detection was removed from Server.swift (users use --vision flag). The vision integration test for LFM2.5-VL-450M was relying on that auto-detection by passing 'no' for the vision flag argument. Now passes 'yes' so --vision is always provided for VLM models.
Qwen3.5-9B-4bit (~5.8GB) + Qwen3.5-0.8B (~0.5GB) = 6.3GB base, which leaves no room for KV cache expansion on macos-15 7GB runners. The server crashes with Abort trap: 6 (malloc assertion) after generating only a few tokens. Use Qwen3.5-2B-4bit (~1.7GB) with NUM_DRAFT_TOKENS=2 instead: - Main: 2B (1.7GB) + Draft: 0.8B (0.5GB) = 2.2GB — fits easily - NUM_DRAFT_TOKENS=2 (vs 4 in the main speculative-decoding job) keeps this job testing a distinct speculation depth configuration The job comment saying '9B' was aspirational only — 9B cannot reliably run on 7GB CI runners without a dedicated large-runner budget.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.