Skip to content

Commit 6ad2d01

Browse files
authored
Merge pull request #28 from SharpAI/feature/speculative-decoding-ci
Feature/speculative decoding ci
2 parents 7f68fca + 087e8d9 commit 6ad2d01

3 files changed

Lines changed: 23 additions & 2 deletions

File tree

.agents/workflows/run-benchmark.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,25 @@ The profiler will:
5252
- **Different contexts**: Change `--contexts` (comma-separated list of token counts)
5353
- **Output file**: Change `--out` path
5454

55+
## Expert Top-K Tuning for MoE Models
56+
57+
For Mixture of Expert (MoE) models (like `Qwen3.5-122B-A10B-4bit`), you can override the number of dynamically routed experts per token using the `SWIFTLM_TOP_K` environment variable. By default, SwiftLM evaluates the maximum number of experts defined by the model architecture. Reducing this trades marginal quality for extreme memory compression and streaming speed gains.
58+
59+
Provide the parameter securely when running the profiler:
60+
```bash
61+
SWIFTLM_TOP_K=6 python3 -u scripts/profiling/profile_runner.py ...
62+
```
63+
64+
### Reference Pipeline (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)
65+
66+
| Configuration | tok/s | vs. Original | Notes |
67+
|---|---|---|---|
68+
| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |
69+
| `SWIFTLM_TOP_K=8` | 4.95 | 8.5× | All 8 experts evaluated (Full quality) |
70+
| `SWIFTLM_TOP_K=6` | 5.20 | 9.0× | Recommended default |
71+
| `SWIFTLM_TOP_K=4` | 5.91 | 10.2× | Best quality/speed tradeoff (Speed mode) |
72+
| `SWIFTLM_TOP_K=2` | 6.52 | 11.2× | Still coherent output (Turbo mode) |
73+
5574
## After the Benchmark
5675

5776
4. Review the generated markdown file and check for any `FAILED / OOM` entries.

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ jobs:
154154
- name: Run speculative decoding E2E
155155
env:
156156
HF_HUB_DOWNLOAD_TIMEOUT: "900"
157+
SWIFTLM_TOP_K: "4"
157158
run: |
158159
chmod +x tests/test-speculative.sh
159160
for attempt in 1 2 3; do
@@ -219,6 +220,7 @@ jobs:
219220
- name: Run speculative evaluation E2E
220221
env:
221222
HF_HUB_DOWNLOAD_TIMEOUT: "900"
223+
SWIFTLM_TOP_K: "4"
222224
run: |
223225
chmod +x tests/test-speculative-eval.sh
224226
for attempt in 1 2 3; do

tests/test-speculative-eval.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ PORT="${2:-15414}"
2323
HOST="127.0.0.1"
2424
MAIN_MODEL="${MAIN_MODEL:-mlx-community/Qwen3.5-9B-4bit}"
2525
DRAFT_MODEL="${DRAFT_MODEL:-mlx-community/Qwen3.5-0.8B-MLX-4bit}"
26-
NUM_DRAFT_TOKENS=2
26+
NUM_DRAFT_TOKENS=1
2727
URL="http://${HOST}:${PORT}"
2828
PASS=0
2929
FAIL=0
@@ -148,7 +148,7 @@ log "Test 3: Streaming speculative generation"
148148

149149
STREAM_OUTPUT=$(curl -sf -N --max-time 120 -X POST "$URL/v1/chat/completions" \
150150
-H "Content-Type: application/json" \
151-
-d "{\"model\":\"$MAIN_MODEL\",\"stream\":true,\"max_tokens\":30,\"messages\":[{\"role\":\"user\",\"content\":\"Name three fruits.\"}]}" \
151+
-d "{\"model\":\"$MAIN_MODEL\",\"stream\":true,\"max_tokens\":10,\"messages\":[{\"role\":\"user\",\"content\":\"Name three fruits.\"}]}" \
152152
2>/dev/null || true)
153153

154154
if echo "$STREAM_OUTPUT" | grep -q "data: \[DONE\]"; then

0 commit comments

Comments
 (0)