Commit 5863ffa
committed
Update on "[ET Device Support] CUDA-native Qwen 3.5 MoE inference with device tensor pipeline"
Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.
- Export: Multi-method export (`forward` + `sample`) with device memory
planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
between forward and sample, reuses CUDA tensors across iterations,
and only copies the 8-byte token ID back to CPU for EOS checking.
Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
[ghstack-poisoned]1 parent 90b8a6e commit 5863ffa
2 files changed
Lines changed: 11 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
16 | 21 | | |
17 | 22 | | |
18 | 23 | | |
| |||
56 | 61 | | |
57 | 62 | | |
58 | 63 | | |
59 | | - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
60 | 67 | | |
61 | 68 | | |
62 | 69 | | |
| |||
373 | 380 | | |
374 | 381 | | |
375 | 382 | | |
| 383 | + | |
376 | 384 | | |
377 | 385 | | |
378 | 386 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
0 commit comments