-
|
I'm using Qwen MTP models. They require a little bit more VRAM compared to non-MTP version which is expected because of the draft model (weight + kv cache). I see that there is flag |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
MTP has no layers to move to CPU. It doesn't use a second model, it uses the same model with different less complex heads. The extra heads only take up about half a gig. |
Beta Was this translation helpful? Give feedback.
MTP has no layers to move to CPU. It doesn't use a second model, it uses the same model with different less complex heads. The extra heads only take up about half a gig.