Unsuccessful attempt at obtaining a working NVFP4 GGUF from a working NVFP4 model. #23627
-
|
Hi, I have a more like theoretical question: is it possible that some models can't be converted to working NVFP4 GGUFs by the method of using Using the code from this PR #21095 I converted Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4 "successfully". But it doesn't work for inference in llama.cpp server, after it reads/processes the first prompt never outputs anything as a response. Because I got the exact same results a while ago with a NVFP4 version obtained by myself (using ModelOpt structure) from the original model mistralai/Devstral-Small-2-24B-Instruct-2512 I think the problem is not with the code from the above mentioned PR but with the original model itself. I don't know enough to figure out why. Possibly is a special case because the original Devstral model is already in BF16/FP8 mixed precision? (First I blamed my incompetence at quantizing to |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 15 replies
-
|
Fixed with #23629 👍 |
Beta Was this translation helpful? Give feedback.
-
|
What are the benefits of using NVFP4 instead of using some of unsloth dynamic Q4? |
Beta Was this translation helpful? Give feedback.
-
|
To be honest I was hoping that the NVFP4 version of the above Devstral model would be faster on my Blackwell RTX 5060 Ti 16 GB compared with Unsloth's IQ4_NL. In practice it is not, it is even bigger in size and doesn't fit in VRAM at a useful context size (iq4_nl doesn't either). The prefill is a little bit faster than iq4_nl but after that the CPU gets involved and the output tanks to 0.x t/s (vs 2.x with iq4_nl). Maybe it would have been faster if the model would have been small enough to fit in VRAM, I don't know. This model is simply more than what my GPU can bite. |
Beta Was this translation helpful? Give feedback.
-
|
Yes...
MXFP6 and MXFP8 both are as well. I've tried both for kv cache and It is
not faster. It takes more time to generate the scales. FP8 , on the other
hand, is very fast.
…On Fri, Jun 5, 2026, 12:04 PM jovan2009 ***@***.***> wrote:
@michaelw9999 <https://github.com/michaelw9999>
I apologize if this is a silly question: are there any quants in llama.cpp
beside nvfp4 that are "accelerated" on Blackwell the same way nvfp4 is?
Maybe your mxfp6? I'm interested in quants that are used for kv cache
(for which nvfp4 is not available). I notice a dramatic slowdown with any
model with the increase in "consumed" context window and I wonder if "more
accelerated" quants would not help. Lately I notice that with kv_cache q5_0
and q4_0 I'm not getting a completely dumb model and I'm thinking that
mxfp6 + nvfp4 would be really nice for kv_cache if that would be
possible. In ComfyUI territory there is even a mxfp8 quant supported by
https://github.com/silveroxides/convert_to_quant, I'm not sure if or how
would that translate to llama.cpp.
—
Reply to this email directly, view it on GitHub
<#23627?email_source=notifications&email_token=BTEDPHESVJF7337K6PI6EH346MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17196919>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BTEDPHA33HEI4ALC23QE6W346MKS3AVCNFSM6AAAAACZLKINA2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTOMJZGY4TCOI>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/BTEDPHH4MPGZJUEJCPWIL3L46MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSVGM33PORSXEX3JN5ZQ>
and Android
<https://github.com/notifications/mobile/android/BTEDPHB2H6UEDHE25VREYUT46MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>.
Download it today!
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
No, it is actually compressing the words and the content that you type into
the AI, and the entire context history of the session.
|
Beta Was this translation helpful? Give feedback.
-
|
Google released a bunch of QAT q4_0 Gemma 4 models. https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-unquantized I feel they are potential good candidates for "pure" NVFP4. Unsloth's GGUF version https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF runs pretty Ok on my system (~2x the speed of NVFP4 Devstral but still slow: ~2.1 t/s inference at 91% context window filled out of 256K, kv_cache q8_0/q8_0) |
Beta Was this translation helpful? Give feedback.

Fixed with #23629 👍