Skip to content

Commit 8617e8e

Browse files
committed
Merge branch 'main' into feature/swiftbuddy-mempalace-v1
2 parents 0c3281b + 00ce868 commit 8617e8e

File tree

8 files changed

+839
-9
lines changed

8 files changed

+839
-9
lines changed

.agents/workflows/run-benchmark.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,25 @@ The profiler will:
5252
- **Different contexts**: Change `--contexts` (comma-separated list of token counts)
5353
- **Output file**: Change `--out` path
5454

55+
## Expert Top-K Tuning for MoE Models
56+
57+
For Mixture of Expert (MoE) models (like `Qwen3.5-122B-A10B-4bit`), you can override the number of dynamically routed experts per token using the `SWIFTLM_TOP_K` environment variable. By default, SwiftLM evaluates the maximum number of experts defined by the model architecture. Reducing this trades marginal quality for extreme memory compression and streaming speed gains.
58+
59+
Provide the parameter securely when running the profiler:
60+
```bash
61+
SWIFTLM_TOP_K=6 python3 -u scripts/profiling/profile_runner.py ...
62+
```
63+
64+
### Reference Pipeline (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)
65+
66+
| Configuration | tok/s | vs. Original | Notes |
67+
|---|---|---|---|
68+
| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |
69+
| `SWIFTLM_TOP_K=8` | 4.95 | 8.5× | All 8 experts evaluated (Full quality) |
70+
| `SWIFTLM_TOP_K=6` | 5.20 | 9.0× | Recommended default |
71+
| `SWIFTLM_TOP_K=4` | 5.91 | 10.2× | Best quality/speed tradeoff (Speed mode) |
72+
| `SWIFTLM_TOP_K=2` | 6.52 | 11.2× | Still coherent output (Turbo mode) |
73+
5574
## After the Benchmark
5675

5776
4. Review the generated markdown file and check for any `FAILED / OOM` entries.

.github/workflows/benchmark.yml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: Performance Benchmark
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
model_id:
7+
description: 'HuggingFace Model ID (must be ungated and fit in 7GB RAM)'
8+
required: true
9+
default: 'mlx-community/gemma-4-e4b-it-4bit'
10+
contexts:
11+
description: 'Comma separated context lengths'
12+
required: true
13+
default: '512,1024,4096'
14+
use_ssd_stream:
15+
description: 'Enable SSD Expert Streaming'
16+
type: boolean
17+
required: false
18+
default: false
19+
20+
jobs:
21+
benchmark:
22+
runs-on: macos-15
23+
timeout-minutes: 60
24+
steps:
25+
- uses: actions/checkout@v4
26+
with:
27+
submodules: recursive
28+
29+
- name: Install Metal Toolchain
30+
run: xcodebuild -downloadComponent MetalToolchain || true
31+
32+
- name: Cache Swift packages
33+
uses: actions/cache@v4
34+
with:
35+
path: .build
36+
key: ${{ runner.os }}-spm-SwiftLM-v2-${{ hashFiles('Package.resolved') }}
37+
restore-keys: |
38+
${{ runner.os }}-spm-SwiftLM-v2-
39+
40+
- name: Resolve dependencies
41+
run: swift package resolve
42+
43+
- name: Build (Release)
44+
run: swift build -c release
45+
46+
- name: Install MLX Metal library & Profiling Dependencies
47+
run: |
48+
python3 -m venv /tmp/mlx_venv
49+
/tmp/mlx_venv/bin/pip install --quiet mlx psutil requests
50+
cp /tmp/mlx_venv/lib/python*/site-packages/mlx/lib/mlx.metallib .build/release/
51+
52+
- name: Cache MLX models
53+
uses: actions/cache@v4
54+
with:
55+
path: ~/.cache/huggingface
56+
key: mlx-benchmark-model-${{ github.event.inputs.model_id }}
57+
58+
- name: Run Benchmark Script
59+
env:
60+
HF_HUB_DOWNLOAD_TIMEOUT: "900"
61+
run: |
62+
EXTRA_FLAGS=""
63+
if [ "${{ github.event.inputs.use_ssd_stream }}" = "true" ]; then
64+
EXTRA_FLAGS="--ssd-only"
65+
echo "Enabled SSD Streaming mode"
66+
fi
67+
68+
# Use the environment Python that has the pip dependencies
69+
source /tmp/mlx_venv/bin/activate
70+
71+
python3 -u scripts/profiling/profile_runner.py \
72+
--model "${{ github.event.inputs.model_id }}" \
73+
--contexts "${{ github.event.inputs.contexts }}" \
74+
$EXTRA_FLAGS \
75+
--out "./github-action-benchmark.md"
76+
77+
- name: Upload Benchmark Results
78+
uses: actions/upload-artifact@v4
79+
with:
80+
name: benchmark-results
81+
path: ./github-action-benchmark.md
82+
retention-days: 7

.github/workflows/ci.yml

Lines changed: 138 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,4 +115,141 @@ jobs:
115115
with:
116116
name: ci-test-logs-${{ matrix.modality }}
117117
path: /tmp/SwiftLM-test-*.log
118-
retention-days: 1
118+
retention-days: 7
119+
120+
# ── Speculative Decoding E2E (dual-model: 0.8B draft + 4B main) ──
121+
# Uses the standard macos-15 runner (7 GB RAM).
122+
# We test the 4B main model which safely fits within memory.
123+
speculative-decoding:
124+
runs-on: macos-15
125+
timeout-minutes: 45
126+
needs: ci # Only run after core CI passes
127+
steps:
128+
- uses: actions/checkout@v4
129+
with:
130+
submodules: recursive
131+
132+
- name: Install Metal Toolchain
133+
run: xcodebuild -downloadComponent MetalToolchain || true
134+
135+
- name: Cache Swift packages
136+
uses: actions/cache@v4
137+
with:
138+
path: .build
139+
key: ${{ runner.os }}-spm-SwiftLM-v2-${{ hashFiles('Package.resolved') }}
140+
restore-keys: |
141+
${{ runner.os }}-spm-SwiftLM-v2-
142+
143+
- name: Clear stale module cache
144+
run: find .build -type d -name ModuleCache -exec rm -rf {} + 2>/dev/null || true
145+
146+
- name: Resolve dependencies
147+
run: swift package resolve
148+
149+
- name: Build (Release)
150+
run: swift build -c release
151+
152+
- name: Install MLX Metal library
153+
run: |
154+
python3 -m venv /tmp/mlx_venv
155+
/tmp/mlx_venv/bin/pip install --quiet mlx
156+
cp /tmp/mlx_venv/lib/python*/site-packages/mlx/lib/mlx.metallib .build/release/
157+
158+
- name: Cache MLX models (draft + main)
159+
uses: actions/cache@v4
160+
with:
161+
path: ~/.cache/huggingface
162+
key: mlx-speculative-qwen35-0.8b-9b
163+
164+
- name: Run speculative decoding E2E
165+
env:
166+
HF_HUB_DOWNLOAD_TIMEOUT: "900"
167+
SWIFTLM_TOP_K: "4"
168+
run: |
169+
chmod +x tests/test-speculative.sh
170+
for attempt in 1 2 3; do
171+
echo "Attempt $attempt of 3..."
172+
if tests/test-speculative.sh .build/release/SwiftLM 15414; then
173+
exit 0
174+
fi
175+
if [ "$attempt" -lt 3 ]; then
176+
echo "Test failed, retrying in 10s..."
177+
sleep 10
178+
fi
179+
done
180+
echo "All attempts failed"
181+
exit 1
182+
183+
- name: Upload speculative test logs on failure
184+
if: failure()
185+
uses: actions/upload-artifact@v4
186+
with:
187+
name: speculative-test-logs
188+
path: /tmp/SwiftLM-test-speculative.log
189+
retention-days: 7
190+
191+
# ── Speculative Decoding Memory Evaluation ──
192+
# Runs the 9B model with NUM_DRAFT_TOKENS=2 to check peak
193+
# memory compression/efficiency. Allowed to OOM/fail.
194+
speculative-decoding-eval:
195+
runs-on: macos-15
196+
timeout-minutes: 45
197+
needs: ci
198+
continue-on-error: true
199+
steps:
200+
- uses: actions/checkout@v4
201+
with:
202+
submodules: recursive
203+
204+
- name: Install Metal Toolchain
205+
run: xcodebuild -downloadComponent MetalToolchain || true
206+
207+
- name: Cache Swift packages
208+
uses: actions/cache@v4
209+
with:
210+
path: .build
211+
key: ${{ runner.os }}-spm-SwiftLM-v2-${{ hashFiles('Package.resolved') }}
212+
restore-keys: |
213+
${{ runner.os }}-spm-SwiftLM-v2-
214+
215+
- name: Clear stale module cache
216+
run: find .build -type d -name ModuleCache -exec rm -rf {} + 2>/dev/null || true
217+
218+
- name: Resolve dependencies
219+
run: swift package resolve
220+
221+
- name: Build (Release)
222+
run: swift build -c release
223+
224+
- name: Install MLX Metal library
225+
run: |
226+
python3 -m venv /tmp/mlx_venv
227+
/tmp/mlx_venv/bin/pip install --quiet mlx
228+
cp /tmp/mlx_venv/lib/python*/site-packages/mlx/lib/mlx.metallib .build/release/
229+
230+
- name: Run speculative evaluation E2E
231+
env:
232+
HF_HUB_DOWNLOAD_TIMEOUT: "900"
233+
SWIFTLM_TOP_K: "4"
234+
run: |
235+
chmod +x tests/test-speculative-eval.sh
236+
for attempt in 1 2 3; do
237+
echo "Attempt $attempt of 3..."
238+
if tests/test-speculative-eval.sh .build/release/SwiftLM 15414; then
239+
exit 0
240+
fi
241+
if [ "$attempt" -lt 3 ]; then
242+
echo "Test failed, retrying in 10s..."
243+
sleep 10
244+
fi
245+
done
246+
echo "All attempts failed"
247+
exit 1
248+
249+
- name: Upload speculative eval logs on failure
250+
if: failure()
251+
uses: actions/upload-artifact@v4
252+
with:
253+
name: speculative-eval-logs
254+
path: /tmp/SwiftLM-test-speculative-eval.log
255+

README.md

Lines changed: 70 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB
8686
- 👁️ **Vision-Language Models (VLM)**: Native multimodal vision processing natively on Metal via the `--vision` flag, supporting real-time base64 image parsing (e.g., Qwen2-VL, PaliGemma).
8787
- 🎧 **Audio-Language Models (ALM)**: High-performance audio ingestion via the `--audio` flag, decoding OpenAI-spec `input_audio` payloads with AVFoundation WAV extraction.
8888
- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
89-
- 💾 **SSD Expert Streaming**: *Experimental* zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models). Read the [SSD Streaming Architecture limits & documentation](docs/moe_ssd_streaming_architecture.md).
89+
- 💾 **SSD Expert Streaming (10x)**: High-performance NVMe streaming that loads Mixture of Experts (MoE) layers directly from SSD to GPU — engineered by [@ericjlake](https://github.com/ericjlake), achieving **10x speedup** (0.58 → 5.91 tok/s) on 122B+ models with only ~10 GB resident memory. Uses cross-projection batching, concurrent pread (QD=24), asyncEval pipeline, and runtime top-k expert selection.
90+
- 🔮 **Speculative Decoding**: Load a small draft model (e.g. 9B) alongside a large main model to generate candidate tokens and verify in bulk — accelerating in-RAM inference.
9091
- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM.
9192

9293
---
@@ -173,6 +174,64 @@ Reference implementations: [`turboquant-mlx`](https://github.com/sharpner/turboq
173174

174175
---
175176

177+
## 💾 SSD Expert Streaming: 10x MoE Speedup
178+
179+
SwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by [Eric Lake](https://github.com/ericjlake)) that achieves 10x generation speedup for massive Mixture of Experts (MoE) models running on memory-constrained Apple Silicon. This enables running models like **Qwen3.5-122B** (69.6 GB) and **Qwen3.5-397B** (209 GB) on a **64 GB Mac** by streaming expert weights from NVMe SSD.
180+
181+
### Benchmark Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)
182+
183+
| Configuration | tok/s | vs. Original | Notes |
184+
|---|---|---|---|
185+
| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |
186+
| **This PR (top-k=8, full quality)** | **4.95** | **8.5×** | All 8 experts evaluated |
187+
| **This PR (top-k=6, default)** | **5.20** | **9.0×** | Recommended default |
188+
| **This PR (top-k=4, speed mode)** | **5.91** | **10.2×** | Best quality/speed tradeoff |
189+
| **This PR (top-k=2, turbo mode)** | **6.52** | **11.2×** | Still coherent output |
190+
191+
> Memory stable at **~10.6 GB resident**, no swap activity. Tested over 200-token generation runs.
192+
193+
### The Approach: Small Model Helps Large Model
194+
195+
A novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.
196+
197+
> **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.
198+
199+
### Optimization Techniques
200+
201+
1. **Cross-Projection Batching**: Collapses ~1,400 per-expert `eval()` calls down to ~48 per token by orchestrating gate/up/down projections together in `SwitchGLU`.
202+
2. **Concurrent NVMe pread (QD=24)**: Replaces sequential pread with `DispatchQueue.concurrentPerform`, saturating the NVMe controller's queue depth (8 experts × 3 projections = 24 parallel reads).
203+
3. **AsyncEval Pipeline with Speculative Pread**: Overlaps GPU compute with SSD I/O — uses previous-token routing to speculatively pre-load experts for the next token during the GPU async window (~70% hit rate). Only missed experts (~30%) require on-demand pread after routing sync.
204+
4. **Persistent Metal Buffers**: Expert weight buffers are allocated once per `SwitchGLU` layer and reused across tokens, eliminating per-token allocation overhead.
205+
5. **Runtime Top-K Expert Selection**: The `SWIFTLM_TOP_K` environment variable reduces the number of active experts per token at runtime without model recompilation — trading marginal quality for significant speed gains.
206+
207+
### Key Engineering Findings
208+
209+
| Finding | Detail |
210+
|---|---|
211+
| **GPU compute is the bottleneck** | At steady state, GPU compute is ~190ms of ~200ms per-token time. The OS page cache serves ~90% of expert reads from RAM. |
212+
| **Don't cache experts in application memory** | An LRU expert cache *stole* from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it. |
213+
| **MambaCache requires checkpoint rollback** | Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone. We implemented `checkpoint()`/`restore()` for speculative decoding on hybrid Attention+Mamba architectures (Qwen3.5). |
214+
215+
### Usage
216+
217+
```bash
218+
# Standard SSD streaming (recommended, top-k=6):
219+
SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
220+
--model <path>/Qwen3.5-122B-A10B-4bit --stream-experts
221+
222+
# Speed mode (top-k=4):
223+
SWIFTLM_TOP_K=4 SwiftLM --port 8002 \
224+
--model <path>/Qwen3.5-122B-A10B-4bit --stream-experts
225+
226+
# With speculative decoding (in-RAM models only):
227+
SwiftLM --port 8002 \
228+
--model <path>/Qwen3.5-27B-4bit \
229+
--draft-model <path>/Qwen3.5-9B-4bit \
230+
--num-draft-tokens 4
231+
```
232+
233+
---
234+
176235
## 💻 Benchmarks & Testing
177236

178237
Run our automated benchmark suites via the interactive script:
@@ -280,8 +339,10 @@ curl http://localhost:5413/v1/chat/completions \
280339
| `--max-tokens` | `2048` | Max tokens limit per generation |
281340
| `--prefill-size`| `512` | Prompt prefill chunk size (micro-batching for long contexts) |
282341
| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
283-
| `--stream-experts` | `false` | Enable experimental SSD streaming for MoE model expert matrices |
342+
| `--stream-experts` | `false` | Enable SSD expert streaming for MoE models (10x speedup) |
284343
| `--turbo-kv` | `false` | Enable TurboQuant 3-bit KV cache compression |
344+
| `--draft-model` | (none) | Draft model path/ID for speculative decoding (in-RAM models only) |
345+
| `--num-draft-tokens` | `4` | Number of draft tokens per speculation round |
285346

286347
## 📦 Requirements
287348

@@ -301,7 +362,13 @@ The model instantly woke up from "whispering" whitespace and successfully respon
301362

302363
## 🙏 Acknowledgments & Credits
303364

304-
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects for their indispensable reference materials and underlying protocols:
365+
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects and contributors for their indispensable reference materials and underlying protocols:
366+
367+
### Contributors
368+
369+
- **[Eric Lake](https://github.com/ericjlake)** — Engineered the **SSD Expert Streaming 10x rewrite** ([PR #26](https://github.com/SharpAI/SwiftLM/pull/26)), achieving 10× generation speedup on 122B+ MoE models via cross-projection batching, concurrent NVMe pread (QD=24), asyncEval pipeline with speculative pread, and runtime top-k expert selection. Also implemented the **speculative decoding infrastructure** with `DraftModelRef`, dual-model loading, and **MambaCache checkpoint/restore** for hybrid Attention+Mamba architectures.
370+
371+
### Projects & References
305372

306373
- **[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.
307374
- **[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.

0 commit comments

Comments
 (0)