Skip to content

Commit b33801a

Browse files
authored
Merge pull request #77 from SharpAI/fix/issue-72-draft-model-ssd-ram
fix: memory auto-cap strategy for SSD MoE streaming + speculative decoding (Issue #72)
2 parents 336c8a8 + 8385350 commit b33801a

5 files changed

Lines changed: 690 additions & 20 deletions

File tree

.github/workflows/ci.yml

Lines changed: 244 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -219,8 +219,8 @@ jobs:
219219
retention-days: 7
220220

221221
# ── Speculative Decoding Memory Evaluation ──
222-
# Runs the 9B model with NUM_DRAFT_TOKENS=2 to check peak
223-
# memory compression/efficiency. Allowed to OOM/fail.
222+
# Runs the 2B model with NUM_DRAFT_TOKENS=2 to check peak
223+
# memory compression/efficiency. Emits vm_stat readings as step summary.
224224
speculative-decoding-eval:
225225
runs-on: macos-15
226226
timeout-minutes: 45
@@ -277,7 +277,7 @@ jobs:
277277
python3 -m venv /tmp/mlx_venv
278278
/tmp/mlx_venv/bin/pip install --quiet huggingface_hub hf
279279
280-
- name: Cache MLX models (draft + 9B)
280+
- name: Cache MLX models (draft + 2B)
281281
uses: actions/cache@v4
282282
with:
283283
path: ~/.cache/huggingface
@@ -288,6 +288,19 @@ jobs:
288288
source /tmp/mlx_venv/bin/activate
289289
hf download mlx-community/Qwen3.5-2B-4bit || true
290290
hf download mlx-community/Qwen3.5-0.8B-MLX-4bit || true
291+
292+
- name: Snapshot RAM before test
293+
id: ram_before
294+
run: |
295+
PAGE_SIZE=$(sysctl -n hw.pagesize)
296+
RAM=$(vm_stat | awk -v page_size="$PAGE_SIZE" '
297+
/Pages active:/ { v=$3; gsub(/\./, "", v); act=v+0 }
298+
/Pages wired down:/ { v=$4; gsub(/\./, "", v); wire=v+0 }
299+
/Pages occupied by compressor:/ { v=$5; gsub(/\./, "", v); comp=v+0 }
300+
END { printf "%.2f", (act+wire+comp)*page_size/1073741824 }
301+
')
302+
echo "ram_before=$RAM" >> $GITHUB_OUTPUT
303+
echo "RAM before eval: ${RAM} GB"
291304
292305
- name: Run speculative evaluation E2E
293306
env:
@@ -309,11 +322,238 @@ jobs:
309322
done
310323
echo "All attempts failed"
311324
exit 1
312-
325+
326+
- name: Snapshot RAM after test
327+
if: always()
328+
id: ram_after
329+
run: |
330+
PAGE_SIZE=$(sysctl -n hw.pagesize)
331+
RAM=$(vm_stat | awk -v page_size="$PAGE_SIZE" '
332+
/Pages active:/ { v=$3; gsub(/\./, "", v); act=v+0 }
333+
/Pages wired down:/ { v=$4; gsub(/\./, "", v); wire=v+0 }
334+
/Pages occupied by compressor:/ { v=$5; gsub(/\./, "", v); comp=v+0 }
335+
END { printf "%.2f", (act+wire+comp)*page_size/1073741824 }
336+
')
337+
echo "ram_after=$RAM" >> $GITHUB_OUTPUT
338+
echo "RAM after eval: ${RAM} GB"
339+
340+
- name: Emit memory summary
341+
if: always()
342+
run: |
343+
BEFORE="${{ steps.ram_before.outputs.ram_before }}"
344+
AFTER="${{ steps.ram_after.outputs.ram_after }}"
345+
TOTAL=$(sysctl -n hw.memsize | awk '{printf "%.1f", $1/1073741824}')
346+
{
347+
echo "## 📊 Speculative Eval — Memory Readings"
348+
echo "| Metric | Value |"
349+
echo "|--------|-------|"
350+
echo "| Runner physical RAM | ${TOTAL} GB |"
351+
echo "| RAM before test | ${BEFORE} GB |"
352+
echo "| RAM after test | ${AFTER} GB |"
353+
echo "| Delta | $(echo "$AFTER $BEFORE" | awk '{printf "%.2f", $1-$2}') GB |"
354+
} >> $GITHUB_STEP_SUMMARY
355+
313356
- name: Upload speculative eval logs on failure
314357
if: failure()
315358
uses: actions/upload-artifact@v4
316359
with:
317360
name: speculative-eval-logs
318361
path: /tmp/SwiftLM-test-speculative-eval.log
319362

363+
# ── Issue #72 Regression: SSD streaming + draft model RAM guard ──────────────
364+
# Mandatory (not continue-on-error). Enforces the auto-cap-to-1 fix and the
365+
# memoryLimit sentinel on every PR. Uses tiny models (2B main + 0.8B draft)
366+
# sized for the 7 GB macos-15 runner.
367+
#
368+
# Three checks mirror the local Test 10 in run_benchmark.sh:
369+
# [1] Auto-cap warning present in server log
370+
# [2] Peak RAM ≤ 85% of runner physical RAM during inference
371+
# [3] /v1/chat/completions returns valid content
372+
ssd-draft-memory-guard:
373+
runs-on: macos-15
374+
timeout-minutes: 45
375+
needs: build_and_unit_test
376+
steps:
377+
- uses: actions/checkout@v4
378+
with:
379+
submodules: recursive
380+
381+
- name: Download Binary Artifact
382+
uses: actions/download-artifact@v4
383+
continue-on-error: true # fall back to building if artifact expired
384+
with:
385+
name: swiftlm-architecture
386+
path: .build/release/
387+
388+
- name: Build (Release) if artifact missing
389+
run: |
390+
if [ ! -f ".build/release/SwiftLM" ]; then
391+
swift build -c release
392+
fi
393+
chmod +x .build/release/SwiftLM
394+
395+
- name: Install MLX Metal library
396+
run: |
397+
python3 -m venv /tmp/mlx_venv
398+
/tmp/mlx_venv/bin/pip install --quiet mlx huggingface_hub hf
399+
cp /tmp/mlx_venv/lib/python*/site-packages/mlx/lib/mlx.metallib .build/release/
400+
401+
- name: Cache MLX models (2B main + 0.8B draft)
402+
uses: actions/cache@v4
403+
with:
404+
path: ~/.cache/huggingface
405+
key: mlx-ssd-draft-guard-qwen35-2b-0.8b
406+
407+
- name: Pre-download models
408+
run: |
409+
source /tmp/mlx_venv/bin/activate
410+
hf download mlx-community/Qwen3.5-2B-4bit || true
411+
hf download mlx-community/Qwen3.5-0.8B-MLX-4bit || true
412+
413+
- name: Snapshot RAM baseline
414+
id: ram_base
415+
run: |
416+
PAGE_SIZE=$(sysctl -n hw.pagesize)
417+
RAM=$(vm_stat | awk -v page_size="$PAGE_SIZE" '
418+
/Pages active:/ { v=$3; gsub(/\./, "", v); act=v+0 }
419+
/Pages wired down:/ { v=$4; gsub(/\./, "", v); wire=v+0 }
420+
/Pages occupied by compressor:/ { v=$5; gsub(/\./, "", v); comp=v+0 }
421+
END { printf "%.2f", (act+wire+comp)*page_size/1073741824 }
422+
')
423+
TOTAL=$(sysctl -n hw.memsize | awk '{printf "%.0f", $1/1073741824}')
424+
LIMIT=$(echo "$TOTAL * 0.85" | bc | cut -d. -f1)
425+
echo "ram_base=$RAM" >> $GITHUB_OUTPUT
426+
echo "runner_ram=$TOTAL" >> $GITHUB_OUTPUT
427+
echo "ram_limit=$LIMIT" >> $GITHUB_OUTPUT
428+
echo "Baseline RAM: ${RAM} GB | Runner: ${TOTAL} GB | Limit: ${LIMIT} GB"
429+
430+
- name: Start SSD + draft server (Issue #72 scenario)
431+
id: server
432+
run: |
433+
# Launch with --num-draft-tokens 4 intentionally — the auto-cap should
434+
# silently reduce it to 1 and log the advisory message.
435+
.build/release/SwiftLM \
436+
--model mlx-community/Qwen3.5-2B-4bit \
437+
--draft-model mlx-community/Qwen3.5-0.8B-MLX-4bit \
438+
--stream-experts \
439+
--num-draft-tokens 4 \
440+
--port 15473 \
441+
--max-tokens 64 \
442+
> /tmp/ssd_draft_guard.log 2>&1 &
443+
PID=$!
444+
echo "server_pid=$PID" >> $GITHUB_OUTPUT
445+
446+
echo "Waiting for server (up to 300s)..."
447+
for i in $(seq 1 300); do
448+
if ! kill -0 $PID 2>/dev/null; then
449+
echo "Server died early:"
450+
cat /tmp/ssd_draft_guard.log
451+
exit 1
452+
fi
453+
if curl -sf http://127.0.0.1:15473/health >/dev/null 2>&1; then
454+
echo "Server ready after ${i}s"
455+
break
456+
fi
457+
sleep 1
458+
if [ "$i" -eq 300 ]; then echo "Timeout"; exit 1; fi
459+
done
460+
461+
- name: Snapshot RAM after model load
462+
id: ram_loaded
463+
run: |
464+
PAGE_SIZE=$(sysctl -n hw.pagesize)
465+
RAM=$(vm_stat | awk -v page_size="$PAGE_SIZE" '
466+
/Pages active:/ { v=$3; gsub(/\./, "", v); act=v+0 }
467+
/Pages wired down:/ { v=$4; gsub(/\./, "", v); wire=v+0 }
468+
/Pages occupied by compressor:/ { v=$5; gsub(/\./, "", v); comp=v+0 }
469+
END { printf "%.2f", (act+wire+comp)*page_size/1073741824 }
470+
')
471+
echo "ram_loaded=$RAM" >> $GITHUB_OUTPUT
472+
echo "RAM after load: ${RAM} GB"
473+
474+
- name: "[1/3] Verify auto-cap warning in server log"
475+
run: |
476+
if grep -q "auto-capping" /tmp/ssd_draft_guard.log; then
477+
echo "✅ Auto-cap warning found — numDraftTokens correctly reduced to 1"
478+
else
479+
echo "❌ Auto-cap warning NOT found in server log"
480+
echo "--- Last 20 lines of server log ---"
481+
tail -20 /tmp/ssd_draft_guard.log
482+
exit 1
483+
fi
484+
485+
- name: "[2/3] Run inference and snapshot peak RAM"
486+
id: ram_peak
487+
run: |
488+
RESULT=$(curl -sf --max-time 90 http://127.0.0.1:15473/v1/chat/completions \
489+
-H "Content-Type: application/json" \
490+
-d '{"model":"test","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":32,"stream":false}' \
491+
2>/dev/null || echo "{}")
492+
echo "$RESULT" > /tmp/inf_result.json
493+
494+
PAGE_SIZE=$(sysctl -n hw.pagesize)
495+
RAM=$(vm_stat | awk -v page_size="$PAGE_SIZE" '
496+
/Pages active:/ { v=$3; gsub(/\./, "", v); act=v+0 }
497+
/Pages wired down:/ { v=$4; gsub(/\./, "", v); wire=v+0 }
498+
/Pages occupied by compressor:/ { v=$5; gsub(/\./, "", v); comp=v+0 }
499+
END { printf "%.2f", (act+wire+comp)*page_size/1073741824 }
500+
')
501+
echo "ram_peak=$RAM" >> $GITHUB_OUTPUT
502+
echo "RAM after inference: ${RAM} GB"
503+
504+
LIMIT="${{ steps.ram_base.outputs.ram_limit }}"
505+
OK=$(echo "$RAM <= $LIMIT" | bc -l)
506+
if [ "$OK" = "1" ]; then
507+
echo "✅ RAM=${RAM}GB ≤ ${LIMIT}GB (85% of ${{ steps.ram_base.outputs.runner_ram }}GB runner RAM)"
508+
else
509+
echo "❌ RAM=${RAM}GB EXCEEDS limit ${LIMIT}GB — Issue #72 regression detected"
510+
echo " (memoryLimit sentinel or auto-cap may have regressed)"
511+
exit 1
512+
fi
513+
514+
- name: "[3/3] Validate inference response"
515+
run: |
516+
RESULT=$(cat /tmp/inf_result.json)
517+
if echo "$RESULT" | grep -q '"content"'; then
518+
TEXT=$(echo "$RESULT" | python3 -c \
519+
"import sys,json;d=json.load(sys.stdin);print(d['choices'][0]['message']['content'])" \
520+
2>/dev/null || echo "(parse error)")
521+
echo "✅ Response: $TEXT"
522+
else
523+
echo "❌ No content in response — server may have crashed or returned empty"
524+
echo "Raw: ${RESULT:0:300}"
525+
exit 1
526+
fi
527+
528+
- name: Stop server
529+
if: always()
530+
run: kill ${{ steps.server.outputs.server_pid }} 2>/dev/null || true
531+
532+
- name: Emit memory summary to step summary
533+
if: always()
534+
run: |
535+
BASE="${{ steps.ram_base.outputs.ram_base }}"
536+
LOADED="${{ steps.ram_loaded.outputs.ram_loaded }}"
537+
PEAK="${{ steps.ram_peak.outputs.ram_peak }}"
538+
TOTAL="${{ steps.ram_base.outputs.runner_ram }}"
539+
LIMIT="${{ steps.ram_base.outputs.ram_limit }}"
540+
{
541+
echo "## 🛡️ Issue #72 — SSD + Draft Model RAM Guard"
542+
echo "| Metric | Value | Threshold |"
543+
echo "|--------|-------|-----------|"
544+
echo "| Runner physical RAM | ${TOTAL} GB | — |"
545+
echo "| RAM baseline (before server) | ${BASE} GB | — |"
546+
echo "| RAM after model load | ${LOADED} GB | — |"
547+
echo "| RAM after inference (peak) | ${PEAK} GB | ≤ ${LIMIT} GB (85%) |"
548+
echo "| Load delta | $(echo "$LOADED $BASE" | awk '{printf "%.2f", $1-$2}') GB | — |"
549+
echo "| Inference delta | $(echo "$PEAK $LOADED" | awk '{printf "%.2f", $1-$2}') GB | — |"
550+
} >> $GITHUB_STEP_SUMMARY
551+
552+
- name: Upload server log on failure
553+
if: failure()
554+
uses: actions/upload-artifact@v4
555+
with:
556+
name: ssd-draft-guard-log
557+
path: /tmp/ssd_draft_guard.log
558+
retention-days: 7
559+

README.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,11 @@ SwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by
242242

243243
A novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.
244244

245-
> **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.
245+
> **Performance note:** Combining `--stream-experts` with `--draft-model` requires care. The verify pass sends N+1 tokens simultaneously, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. At the default `--num-draft-tokens 4` this creates a **5× I/O fan-out** that regresses throughput below solo SSD streaming.
246+
>
247+
> **Auto-cap strategy (Issue #72 fix):** SwiftLM automatically caps `--num-draft-tokens` to **1** when both flags are active. With 1 draft token the verify pass covers only 2 positions (2× fan-out). If the draft model's acceptance rate is ≥ 50% — typical for same-family models — the net throughput is still positive despite the 2× I/O overhead. A startup advisory is printed when the cap fires.
248+
>
249+
> For maximum throughput: use `--stream-experts` alone (no draft model).
246250
247251
### Optimization Techniques
248252

@@ -271,11 +275,20 @@ SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
271275
SWIFTLM_TOP_K=4 SwiftLM --port 8002 \
272276
--model <path>/Qwen3.5-122B-A10B-4bit --stream-experts
273277

274-
# With speculative decoding (in-RAM models only):
278+
# With speculative decoding (in-RAM models only — both models fit in RAM):
275279
SwiftLM --port 8002 \
276280
--model <path>/Qwen3.5-27B-4bit \
277281
--draft-model <path>/Qwen3.5-9B-4bit \
278282
--num-draft-tokens 4
283+
284+
# With SSD streaming + draft model (auto-cap mode):
285+
# SwiftLM automatically caps --num-draft-tokens to 1 to minimise the
286+
# verify-pass I/O fan-out. Net positive if draft acceptance rate ≥ 50%.
287+
SwiftLM --port 8002 \
288+
--model <path>/Qwen3.5-122B-A10B-4bit \
289+
--stream-experts \
290+
--draft-model <path>/Qwen3.5-9B-4bit
291+
# ↑ num-draft-tokens is auto-capped to 1 at startup
279292
```
280293

281294
---
@@ -404,8 +417,8 @@ curl http://localhost:5413/v1/chat/completions \
404417
| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
405418
| `--stream-experts` | `false` | Enable SSD expert streaming for MoE models (10x speedup) |
406419
| `--turbo-kv` | `false` | Enable TurboQuant 3-bit KV cache compression (activates after 2048 tokens, server-wide) |
407-
| `--draft-model` | (none) | Draft model path/ID for speculative decoding (in-RAM models only) |
408-
| `--num-draft-tokens` | `4` | Number of draft tokens per speculation round |
420+
| `--draft-model` | (none) | Draft model path/ID for speculative decoding. When used with `--stream-experts`, `--num-draft-tokens` is auto-capped to 1 to minimise SSD I/O fan-out (see performance note above). |
421+
| `--num-draft-tokens` | `4` | Tokens per speculation round. Auto-capped to 1 when combined with `--stream-experts`. |
409422

410423
## 🔧 Per-Request API Parameters
411424

0 commit comments

Comments
 (0)