Commit 80997fd
authored
Make the output of MoE forward method have expected output in non cuda backends (#19170)
Dropped the unconditional `.float()` from the `temperature is None`
branch of `Qwen35MoE.forward` to keep its output having the model
author's expected dtype.
# Qwen 3.5 MoE perf comparsion between this PR and e2eb417
i did detailed performance comparsion between this PR and the state
before applying cuda sampler (commit e2eb417) to see if we can bring
perf back.
TLDR: With this PR our perf is same or even better than the previous
state when running on tiny model across mlx and metal, and on full model
+ mlx, but crashed on full model on metal; on full model mlx
## Tiny Model
**Setup:** M3 Max 128 GB · macOS 26.4 · Xcode 26.4.1 · `--tiny-test`
model · MLX `--qlinear 4w --qlinear-group-size 32` · Metal `--qlinear
fpa4w` · all measurements use in-process warmup (MLX: warmup at prefill
+ decode shapes + force-eval; Metal: `--warmup_iters 2
--warmup_decode_steps 4 --ignore_eos`) · median of 3-6 trials.
### MLX (Python pybindings)
| Config | Metric | Before this PR | After this PR | Δ |
|---|---|---:|---:|---:|
| prompt-len=4, max-new=5 | Prefill tok/s | 1077 | **1195** | **+11%** |
| prompt-len=4, max-new=5 | Decode tok/s | 294 | **350** | **+19%** |
| prompt-len=32, max-new=31 | Prefill tok/s | 7060 | **10842** |
**+54%** |
| prompt-len=32, max-new=31 | Decode tok/s | 314 | 267\* | −15% (within
trial noise, see note) |
\* prompt=32 decode trial-by-trial: 281 / 267 / 247 (3 trials).
Trial-to-trial spread is ~14%, so the apparent regression is within
noise.
### Metal (C++ runner)
| Config | Metric | Before this PR (median of 6) | After this PR (median
of 6) | Δ |
|---|---|---:|---:|---:|
| prompt-len=32, max-new=31 | Prefill tok/s (mean ex-cold) | 5351 |
**5988** | **+12%** |
| prompt-len=32, max-new=31 | Decode tok/s (mean ex-cold) | 217 |
**286** | **+32%** |
| prompt-len=32, max-new=31 | Decode tok/s (median ex-cold) | 237 |
**290** | **+22%** |
## Full Model
**Setup:** Qwen/Qwen3.5-35B-A3B (40 layers, 2048d, 256 experts top-8, 67
GB safetensors) · M3 Max 128 GB · macOS 26.4 · Xcode 26.4.1 · MLX
`--qlinear 4w --qlinear-group-size 64` · in-process warmup at
prefill+decode shapes + force-eval after prefill · median of 3 trials
per config.
### MLX (full Qwen 3.5 MoE 35B-A3B)
| Config | Metric | Before this PR | After this PR | Δ |
|---|---|---:|---:|---:|
| prompt=4, max-new=5 | Prefill tok/s | 133.7 | **163.6** | **+22%** |
| prompt=4, max-new=5 | Decode tok/s | 36.4 | **44.7** | **+23%** |
| prompt=32, max-new=32 | Prefill tok/s | 404.3 | **443.4** | **+10%** |
| prompt=32, max-new=32 | Decode tok/s | 37.2 | **43.4** | **+17%** |
| prompt=128, max-new=64 | Prefill tok/s | 650.3 | **711.5** | **+9%** |
| prompt=128, max-new=64 | Decode tok/s | 38.5 | **43.1** | **+12%** |
Trial-to-trial variance is small (≤1 tok/s on decode, ≤5% on prefill) so
all deltas are signal.
### Metal (full Qwen 3.5 MoE 35B-A3B)
**Not measured.** Metal export of the 35B model OOM-kills on the 128 GB
Mac during AOTI inductor compilation (`Killed: 9` exit 137). Confirmed
across 3 attempts: default settings, `TORCHINDUCTOR_COMPILE_THREADS=1`,
and `--max-seq-len 1024`. The transient peak during AOTI lowering
exceeds available RAM. Tiny-model Metal A/B (already collected, see
prior summary) shows the same pattern: prefill +12%, decode +22~+32%.
## Conclusion
**No regression on either backend; meaningful uplift on both.** MLX
shows the cleanest improvement on prefill (+11~+54%) and decode (+19% at
small prompt). Metal shows +12% prefill and +22~+32% decode at
prompt=32. The single MLX prompt-32 decode delta is within
trial-to-trial variance.1 parent cf01617 commit 80997fd
3 files changed
Lines changed: 70 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
554 | 554 | | |
555 | 555 | | |
556 | 556 | | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
557 | 581 | | |
558 | 582 | | |
559 | 583 | | |
| |||
568 | 592 | | |
569 | 593 | | |
570 | 594 | | |
| 595 | + | |
| 596 | + | |
571 | 597 | | |
572 | 598 | | |
573 | 599 | | |
| |||
650 | 676 | | |
651 | 677 | | |
652 | 678 | | |
| 679 | + | |
653 | 680 | | |
654 | 681 | | |
655 | 682 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
| |||
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
40 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
41 | 46 | | |
42 | 47 | | |
43 | 48 | | |
| |||
48 | 53 | | |
49 | 54 | | |
50 | 55 | | |
51 | | - | |
52 | | - | |
53 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
54 | 66 | | |
| 67 | + | |
55 | 68 | | |
56 | 69 | | |
57 | 70 | | |
| |||
73 | 86 | | |
74 | 87 | | |
75 | 88 | | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
76 | 96 | | |
77 | 97 | | |
78 | 98 | | |
| |||
133 | 153 | | |
134 | 154 | | |
135 | 155 | | |
| 156 | + | |
136 | 157 | | |
137 | 158 | | |
138 | 159 | | |
139 | 160 | | |
140 | 161 | | |
141 | 162 | | |
142 | 163 | | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
143 | 169 | | |
144 | 170 | | |
145 | 171 | | |
| 172 | + | |
146 | 173 | | |
147 | 174 | | |
148 | 175 | | |
| |||
170 | 197 | | |
171 | 198 | | |
172 | 199 | | |
| 200 | + | |
173 | 201 | | |
174 | 202 | | |
175 | 203 | | |
| |||
224 | 252 | | |
225 | 253 | | |
226 | 254 | | |
227 | | - | |
228 | | - | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
229 | 260 | | |
230 | 261 | | |
231 | 262 | | |
232 | 263 | | |
| 264 | + | |
233 | 265 | | |
234 | 266 | | |
235 | 267 | | |
| |||
260 | 292 | | |
261 | 293 | | |
262 | 294 | | |
| 295 | + | |
263 | 296 | | |
| 297 | + | |
264 | 298 | | |
265 | 299 | | |
266 | 300 | | |
| |||
308 | 342 | | |
309 | 343 | | |
310 | 344 | | |
| 345 | + | |
311 | 346 | | |
| 347 | + | |
312 | 348 | | |
313 | 349 | | |
314 | 350 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
25 | 24 | | |
26 | 25 | | |
27 | 26 | | |
| |||
186 | 185 | | |
187 | 186 | | |
188 | 187 | | |
189 | | - | |
190 | 188 | | |
191 | 189 | | |
192 | 190 | | |
| |||
207 | 205 | | |
208 | 206 | | |
209 | 207 | | |
210 | | - | |
211 | 208 | | |
212 | 209 | | |
213 | 210 | | |
| |||
318 | 315 | | |
319 | 316 | | |
320 | 317 | | |
321 | | - | |
322 | 318 | | |
323 | 319 | | |
324 | 320 | | |
| |||
540 | 536 | | |
541 | 537 | | |
542 | 538 | | |
543 | | - | |
544 | 539 | | |
545 | 540 | | |
546 | 541 | | |
| |||
574 | 569 | | |
575 | 570 | | |
576 | 571 | | |
577 | | - | |
578 | 572 | | |
579 | 573 | | |
580 | 574 | | |
| |||
599 | 593 | | |
600 | 594 | | |
601 | 595 | | |
602 | | - | |
603 | 596 | | |
604 | 597 | | |
605 | 598 | | |
| |||
620 | 613 | | |
621 | 614 | | |
622 | 615 | | |
623 | | - | |
624 | | - | |
625 | | - | |
626 | | - | |
627 | 616 | | |
628 | | - | |
| 617 | + | |
629 | 618 | | |
630 | 619 | | |
631 | 620 | | |
| |||
0 commit comments