Is --spec-draft-ngl working for Qwen3.x MTP? #24167

kstoykov · 2026-06-05T08:21:22Z

kstoykov
Jun 5, 2026

I'm using Qwen MTP models. They require a little bit more VRAM compared to non-MTP version which is expected because of the draft model (weight + kv cache). I see that there is flag --spec-draft-ngl which must specify how many of draft model's layers to be offloaded to the GPU. I tried 'auto', 'all', '0', '999' but all options resulted in same VRAM consumption and performance. Is this expected? I would expected value 0 to move entire draft model to CPU.

Answered by Diablo-D3

Jun 5, 2026

MTP has no layers to move to CPU. It doesn't use a second model, it uses the same model with different less complex heads. The extra heads only take up about half a gig.

View full answer

Diablo-D3 · 2026-06-05T13:23:05Z

Diablo-D3
Jun 5, 2026

MTP has no layers to move to CPU. It doesn't use a second model, it uses the same model with different less complex heads. The extra heads only take up about half a gig.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is --spec-draft-ngl working for Qwen3.x MTP? #24167

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Is --spec-draft-ngl working for Qwen3.x MTP? #24167

Uh oh!

kstoykov Jun 5, 2026

Replies: 1 comment

Uh oh!

Uh oh!

Diablo-D3 Jun 5, 2026

kstoykov
Jun 5, 2026

Diablo-D3
Jun 5, 2026