Skip to content

Auto-detect LFM2.5 VL MLX models#57

Merged
solderzzc merged 8 commits into
mainfrom
codex/lfm25-vl-mlx-regression
Apr 17, 2026
Merged

Auto-detect LFM2.5 VL MLX models#57
solderzzc merged 8 commits into
mainfrom
codex/lfm25-vl-mlx-regression

Conversation

@solderzzc
Copy link
Copy Markdown
Member

No description provided.

@solderzzc solderzzc force-pushed the codex/lfm25-vl-mlx-regression branch from d612286 to 4398222 Compare April 16, 2026 21:27
solderzzc added a commit that referenced this pull request Apr 16, 2026
Three root causes addressed:

1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while
   build_and_unit_test uses v3, causing a full rebuild on every run.
   Unified all jobs to v3.

2. Wrong pre-downloaded model: speculative-decoding job pre-fetched
   Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to
   Qwen3.5-2B-4bit. The 9B model was never used by the test, and the
   2B model was not cached, causing re-download + OOM on 7 GB runner.
   Pre-download now fetches 2B + 0.8B (matching the test defaults).

4. Fragile metallib compile: replaced python setup.py build_ext
   with the proven pip install mlx approach already used in
   build_and_unit_test — eliminates 5-min compile step and
   pybind11/cmake dependency chain.

speculative-decoding-eval retains the 9B model for its heavier
memory evaluation (continue-on-error: true already covers OOM).

Fixes: CI failure on PR #57
solderzzc added a commit that referenced this pull request Apr 16, 2026
Three root causes addressed:

1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while
   build_and_unit_test uses v3, causing a full rebuild on every run.
   Unified all jobs to v3.

2. Wrong pre-downloaded model: speculative-decoding job pre-fetched
   Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to
   Qwen3.5-2B-4bit. The 9B model was never used by the test, and the
   2B model was not cached, causing re-download + OOM on 7 GB runner.
   Pre-download now fetches 2B + 0.8B (matching the test defaults).

4. Fragile metallib compile: replaced python setup.py build_ext
   with the proven pip install mlx approach already used in
   build_and_unit_test — eliminates 5-min compile step and
   pybind11/cmake dependency chain.

speculative-decoding-eval retains the 9B model for its heavier
memory evaluation (continue-on-error: true already covers OOM).

Fixes: CI failure on PR #57
solderzzc added a commit that referenced this pull request Apr 16, 2026
Three root causes addressed:

1. Cache key mismatch: speculative jobs used spm-SwiftLM-v2 while
   build_and_unit_test uses v3, causing a full rebuild on every run.
   Unified all jobs to v3.

2. Wrong pre-downloaded model: speculative-decoding job pre-fetched
   Qwen3.5-9B-4bit but test-speculative.sh MAIN_MODEL defaults to
   Qwen3.5-2B-4bit. The 9B model was never used by the test, and the
   2B model was not cached, causing re-download + OOM on 7 GB runner.
   Pre-download now fetches 2B + 0.8B (matching the test defaults).

4. Fragile metallib compile: replaced python setup.py build_ext
   with the proven pip install mlx approach already used in
   build_and_unit_test — eliminates 5-min compile step and
   pybind11/cmake dependency chain.

speculative-decoding-eval retains the 9B model for its heavier
memory evaluation (continue-on-error: true already covers OOM).

Fixes: CI failure on PR #57
@solderzzc solderzzc force-pushed the codex/lfm25-vl-mlx-regression branch from 4398222 to 4a08e7e Compare April 17, 2026 00:04
…, 2B model

- Use cmake build.sh approach (not python setup.py) for SharpAI fork metallib
- Cache key v2 → v3 to share build cache with build_and_unit_test job
- Pre-download Qwen3.5-2B-4bit (matches MAIN_MODEL default in test-speculative.sh)
  not 9B which was never used and caused OOM on 7 GB runner (Trace/BPT trap: 5)
- Add model cache step to speculative-decoding-eval (was missing)
- hf installed via pip in same venv, source activated before download
@solderzzc solderzzc force-pushed the codex/lfm25-vl-mlx-regression branch from 4a08e7e to 0317475 Compare April 17, 2026 00:42
- Reverts auto-detection of vision capabilities from overriding the user's --vision flag in Server.swift
- Re-adds qwen3_5 to ModelArchitectureProbe since it technically supports vision (has image_token_id)
- By disabling the override, speculative decoding tests (which use Qwen3.5-2B-4bit text paths) will correctly start in text-only mode and avoid the reshape crash
@solderzzc solderzzc force-pushed the codex/lfm25-vl-mlx-regression branch from 639109d to 7407167 Compare April 17, 2026 02:05
Auto-detection was removed from Server.swift (users use --vision flag).
The vision integration test for LFM2.5-VL-450M was relying on that
auto-detection by passing 'no' for the vision flag argument. Now passes
'yes' so --vision is always provided for VLM models.
Qwen3.5-9B-4bit (~5.8GB) + Qwen3.5-0.8B (~0.5GB) = 6.3GB base, which
leaves no room for KV cache expansion on macos-15 7GB runners. The
server crashes with Abort trap: 6 (malloc assertion) after generating
only a few tokens.

Use Qwen3.5-2B-4bit (~1.7GB) with NUM_DRAFT_TOKENS=2 instead:
- Main: 2B (1.7GB) + Draft: 0.8B (0.5GB) = 2.2GB — fits easily
- NUM_DRAFT_TOKENS=2 (vs 4 in the main speculative-decoding job)
  keeps this job testing a distinct speculation depth configuration

The job comment saying '9B' was aspirational only — 9B cannot reliably
run on 7GB CI runners without a dedicated large-runner budget.
@solderzzc solderzzc merged commit 70b9398 into main Apr 17, 2026
8 checks passed
@solderzzc solderzzc deleted the codex/lfm25-vl-mlx-regression branch April 17, 2026 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant