Merge pull request #28 from SharpAI/feature/speculative-decoding-ci

solderzzc · web-flow · commit 6ad2d01e1d0d · 2026-04-12T08:38:15.000-07:00
Feature/speculative decoding ci
diff --git a/.agents/workflows/run-benchmark.md b/.agents/workflows/run-benchmark.md
@@ -52,6 +52,25 @@ The profiler will:
 - **Different contexts**: Change `--contexts` (comma-separated list of token counts)
 - **Output file**: Change `--out` path
 
+## Expert Top-K Tuning for MoE Models
+
+For Mixture of Expert (MoE) models (like `Qwen3.5-122B-A10B-4bit`), you can override the number of dynamically routed experts per token using the `SWIFTLM_TOP_K` environment variable. By default, SwiftLM evaluates the maximum number of experts defined by the model architecture. Reducing this trades marginal quality for extreme memory compression and streaming speed gains.
+
+Provide the parameter securely when running the profiler:
+```bash
+SWIFTLM_TOP_K=6 python3 -u scripts/profiling/profile_runner.py ...
+```
+
+### Reference Pipeline (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)
+
+| Configuration | tok/s | vs. Original | Notes |
+|---|---|---|---|
+| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |
+| `SWIFTLM_TOP_K=8` | 4.95 | 8.5× | All 8 experts evaluated (Full quality) |
+| `SWIFTLM_TOP_K=6` | 5.20 | 9.0× | Recommended default |
+| `SWIFTLM_TOP_K=4` | 5.91 | 10.2× | Best quality/speed tradeoff (Speed mode) |
+| `SWIFTLM_TOP_K=2` | 6.52 | 11.2× | Still coherent output (Turbo mode) |
+
 ## After the Benchmark
 
 4. Review the generated markdown file and check for any `FAILED / OOM` entries.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -154,6 +154,7 @@ jobs:
       - name: Run speculative decoding E2E
         env:
           HF_HUB_DOWNLOAD_TIMEOUT: "900"
+          SWIFTLM_TOP_K: "4"
         run: |
           chmod +x tests/test-speculative.sh
           for attempt in 1 2 3; do
@@ -219,6 +220,7 @@ jobs:
       - name: Run speculative evaluation E2E
         env:
           HF_HUB_DOWNLOAD_TIMEOUT: "900"
+          SWIFTLM_TOP_K: "4"
         run: |
           chmod +x tests/test-speculative-eval.sh
           for attempt in 1 2 3; do
diff --git a/tests/test-speculative-eval.sh b/tests/test-speculative-eval.sh
@@ -23,7 +23,7 @@ PORT="${2:-15414}"
 HOST="127.0.0.1"
 MAIN_MODEL="${MAIN_MODEL:-mlx-community/Qwen3.5-9B-4bit}"
 DRAFT_MODEL="${DRAFT_MODEL:-mlx-community/Qwen3.5-0.8B-MLX-4bit}"
-NUM_DRAFT_TOKENS=2
+NUM_DRAFT_TOKENS=1
 URL="http://${HOST}:${PORT}"
 PASS=0
 FAIL=0
@@ -148,7 +148,7 @@ log "Test 3: Streaming speculative generation"
 
 STREAM_OUTPUT=$(curl -sf -N --max-time 120 -X POST "$URL/v1/chat/completions" \
     -H "Content-Type: application/json" \
-    -d "{\"model\":\"$MAIN_MODEL\",\"stream\":true,\"max_tokens\":30,\"messages\":[{\"role\":\"user\",\"content\":\"Name three fruits.\"}]}" \
+    -d "{\"model\":\"$MAIN_MODEL\",\"stream\":true,\"max_tokens\":10,\"messages\":[{\"role\":\"user\",\"content\":\"Name three fruits.\"}]}" \
     2>/dev/null || true)
 
 if echo "$STREAM_OUTPUT" | grep -q "data: \[DONE\]"; then