Unsuccessful attempt at obtaining a working NVFP4 GGUF from a working NVFP4 model. #23627

jovan2009 · 2026-05-24T18:21:33Z

jovan2009
May 24, 2026

Hi, I have a more like theoretical question: is it possible that some models can't be converted to working NVFP4 GGUFs by the method of using convert_hf_to_gguf.py on safetensors that are already quantized to NVFP4?

Using the code from this PR #21095 I converted Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4 "successfully". But it doesn't work for inference in llama.cpp server, after it reads/processes the first prompt never outputs anything as a response.

Because I got the exact same results a while ago with a NVFP4 version obtained by myself (using ModelOpt structure) from the original model mistralai/Devstral-Small-2-24B-Instruct-2512 I think the problem is not with the code from the above mentioned PR but with the original model itself. I don't know enough to figure out why. Possibly is a special case because the original Devstral model is already in BF16/FP8 mixed precision?

(First I blamed my incompetence at quantizing to NVFP4 but now seeing that I get the same results using Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4 I began to think differently)

Answered by michaelw9999

May 24, 2026

Fixed with #23629 👍

View full answer

michaelw9999 · 2026-05-24T23:09:23Z

michaelw9999
May 24, 2026

Fixed with #23629 👍

0 replies

kstoykov · 2026-06-04T20:56:39Z

kstoykov
Jun 4, 2026

What are the benefits of using NVFP4 instead of using some of unsloth dynamic Q4?

7 replies

michaelw9999 Jun 5, 2026

There's a weird thing about the name dynamic. None of the tensors actually change or use more bits or less bits. A Q4_K tensor is a Q4_K tensor. The block and bit size of it is fixed. There is not a single model out there is "all" Q4, or "all" NVFP4. Every single GGUF you will find is all just a mixed blend. Some tensors are BF16, some are F32, some are Q4_K, some will be Q6_K and Q8, too. If you look at the code in original llama.cpp quantization tool, every "type" actually gets called a MOSTLY type for the predominant type, and then certain tensors are upgraded or downgraded to others depending on what it is.

 for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (category == tensor_category::OUTPUT || (qs.has_tied_embeddings && category == tensor_category::TOKEN_EMBD)) {
        if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
            new_type = qs.params->output_tensor_type;
        } else {
            const int64_t nx = tensor->ne[0];
            const int64_t qk_k = ggml_blck_size(new_type);

            if (ftype == LLAMA_FTYPE_MOSTLY_MXFP4_MOE) {
                new_type = GGML_TYPE_Q8_0;
            }
            else if (arch == LLM_ARCH_FALCON || nx % qk_k != 0) {
                new_type = GGML_TYPE_Q8_0;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ2_S  || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M   ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else if (new_type != GGML_TYPE_Q8_0) {
                new_type = GGML_TYPE_Q6_K;
            }
        }

So every tensor that is Q4_K, that tensor is Q4_K in full, and doesn't ever change again and its bits are fixed - that's it. That's the same for every type across the board, once the model is quantized it's not changing again. What dynamic really means is that when it's made, some tensors are set to be done in a heavier weight or lower, with corresponding better or lower quality, depending on how the quantizer or the person doing it determines, it can be changed with command line flags on the quantizer manually. In my quantizer tool, it's actually dynamic also in the sense that it's computed as the process goes on with an importance score to see what that tensor has what error and might need more bits to compensate, but then you'll lose speed and gain size of the model. More NVFP4 = more speed. Less NVFP4 = less speed. If you made every tensor in the model NVFP4, it would not even run properly. MXFP4 has a lot more error, for example, and so in an "MXFP4 Quant" only the MoE tensors are MXFP4, everything else is generally Q8.

NVFP4 on its own is not even 4 bits at all - the scales add bytes, so NVFP4 is 4.5bpw. It is exactly the same as Q4_K - also 4.5bpw. Q4_K has less error, however, as its scaling design is completely different than NVFP4, and there is no GPU that can decode it 'natively' like NVFP4 can. So NVFP4 is faster than Q4_K on Blackwell. There is little reason to use it otherwise. But remember, this is on a tensor by tensor basis. So to compensate on NVFP4's relative amount of error, other tensors have to be upgraded.
The same is true for Q4_K, as well. On the UD-Q4 models, they are not all Q4. Many tensors are Q5, some might be Q3. This again is the case for every gguf out there. And every mixture into the model is going to be different and all over the place. You might see some NVFP4 models are 20GB and some are 15GB with the same model. The only reason why is for what proportion of each tensor is what type. The safetensors may be larger on those quantizations because the tensor is kept as BF16 for whatever reason (conservative recipe, etc). When converting to GGUF, it's not going to change it - BF16 is BF16. But if you wanted to change it, you can make BF16 tensors Q8 in many cases and it is now about 50% less weight per tensor. There may be tensors that are kept as FP8, but they could be Q6_K or Q5_K instead without much of a problem either. So the blend of everything is what is unique. But as far as the tensors themselves, or the quantization types, there isn't such a thing as "mixed" or "dynamic" types. Each is independently its own type with their own kernels running them.

On Huggingface, you can click on any quantization type and it will show you every tensor by tensor type. For example, this is Unsloth's Qwen3.6-35B-A3B-UD-Q3_K_XL:

To make @jovan2009's Devstral work, in 16GB VRAM, it needs to be made smaller, so the model should probably be no more than 14.5-15GB to leave cache room. So if you want to keep calling it NVFP4, some tensors need to go down in size. That means Q3_K. It's an excellent type too, 3.4375 bpw, and comparable enough to NVFP4 quality. But it's about 3/4th the size. So if you take 1GB of NVFP4 tensors and make them Q3_K, it will be about 750MB. It will give up speed, maintain quality, and you'll want to just enough NVFP4 as possible to make it faster.
Hope that makes sense!

kstoykov Jun 5, 2026

Yes, it totally make sense.

By saying "dynamic" I meant exactly what you said, maybe I just didn't say it well. Thank you very much for detailed information!

I thought that NVFP4 gives higher quality than Q4. Obviously I was wrong. Now I understands why NVFP4 model is larger than corresponding Q4.

I have never done any quantization by myself and I have a follow up questions:

Are all Qx actually ints? I mean - can I assume that Q4 is int4, Q5 is int5, etc?
Each hardware supports different types of data with which it can operate natevely. Assuming that types like Q5 are dequantized to some type supported by the given hardware. Is Q5 dequantization target fixed in llamacpp? That is, for example, is Q5 always dequantized to fp16 or is the type that it is dequantized to is chosen based on the hardware?

I'm trying to understand whether llamacpp actually uses fp4 acceleration in blackwell if I'm running a Q4 model or it order to use any acceleration it must be specifically nvfp4-guff?

kstoykov Jun 5, 2026

I think I found an answer to my last question. As you said I can just check the tensor type. If it says "nvfp4" then it will be accelerated on blackwell - otherwise - no.

What about Qx? Are they accelerated by any architecture? For example is Q8 accelerated by INT8 (intel GPU) or is Q4 dequantized as INT8 (for intel gpu accleration) or fp 4 (for blackwell acceleration)?

michaelw9999 Jun 5, 2026

I think I found an answer to my last question. As you said I can just check the tensor type. If it says "nvfp4" then it will be accelerated on blackwell - otherwise - no.

I don't believe it properly shows NVFP4 as the tensor types on the Huggingface tensor list though. It might be blank or show U8 or something else, but at least trust that if the model is listed as NVFP4 at all, it will have some portion that is natively accelerated on Blackwell.

What about Qx? Are they accelerated by any architecture? For example is Q8 accelerated by INT8 (intel GPU) or is Q4 dequantized as INT8 (for intel gpu accleration) or fp 4 (for blackwell acceleration)?

They are all accelerated by all GPUs when running the right version. The only thing different about NVFP4 that is hardware accelerated is just 1 tiny difference: The hardware does the multiplication of the NVFP4 weight scales during its MMA step. Those are FP8 scales on NVFP4, every 16 weights has a unique scale, so there are a lot of them.
The GPU can do this just fine, that is what they already do billions of times without a problem. The difference is that when you run NVFP4 on Blackwell, the code sends the GPU both the scale and the weight in one step, and gets the result back with that multiplication done. If it's not on Blackwell, you have to manually write a separate line of code to do the multiplication.

Q4 is not 'native FP4' even on Blackwell. The GPU has no idea what Q4 is. So in that case, it's just doing normal math like any other GPU does . It's still hardware accelerated, but it's hardware accelerated multiplication. When it's NVFP4, the command is "do multiplication, and this is NVFP4 and here's the scale". That is much faster.

jovan2009 Jun 5, 2026
Author

@michaelw9999
I apologize if this is a silly question: are there any quants in llama.cpp beside nvfp4 that are "accelerated" on Blackwell the same way nvfp4 is? Maybe your mxfp6? I'm interested in quants that are used for kv cache (for which nvfp4 is not available). I notice a dramatic slowdown with any model with the increase in "consumed" context window and I wonder if "more accelerated" quants would not help. Lately I notice that with kv_cache q5_0 and q4_0 I'm not getting a completely dumb model and I'm thinking that mxfp6 + nvfp4 would be really nice for kv_cache if that would be possible. In ComfyUI territory there is even a mxfp8 quant supported by https://github.com/silveroxides/convert_to_quant, I'm not sure if or how would that translate to llama.cpp.

jovan2009 · 2026-06-04T22:41:26Z

jovan2009
Jun 4, 2026
Author

To be honest I was hoping that the NVFP4 version of the above Devstral model would be faster on my Blackwell RTX 5060 Ti 16 GB compared with Unsloth's IQ4_NL. In practice it is not, it is even bigger in size and doesn't fit in VRAM at a useful context size (iq4_nl doesn't either). The prefill is a little bit faster than iq4_nl but after that the CPU gets involved and the output tanks to 0.x t/s (vs 2.x with iq4_nl). Maybe it would have been faster if the model would have been small enough to fit in VRAM, I don't know. This model is simply more than what my GPU can bite.

7 replies

jovan2009 Jun 5, 2026
Author

To be honest I was hoping that the NVFP4 version of the above Devstral model would be faster on my Blackwell RTX 5060 Ti 16 GB compared with Unsloth's IQ4_NL. In practice it is not, it is even bigger in size and doesn't fit in VRAM at a useful context size (iq4_nl doesn't either). The prefill is a little bit faster than iq4_nl but after that the CPU gets involved and the output tanks to 0.x t/s (vs 2.x with iq4_nl). Maybe it would have been faster if the model would have been small enough to fit in VRAM, I don't know. This model is simply more than what my GPU can bite.

I have to retract, the NVFP4 version is faster than iq4_nl (marginally). I checked now with the same settings. Using a llama.cpp built with GGML_CUDA_FA_ALL_QUANTS, --cache-type-k q5_0, cache-type-v q4_0 and --ctx-size 128 k, a very long first prompt (multiple python, json and txt files) I get much faster prefill rate (~5-600 t/s vs 2-300) and somewhat faster inference (1.1 t/s vs 1.0). I can't comment on accuracy.

jovan2009 Jun 5, 2026
Author

@jovan2009 I'll take a look at it. One of the challenges in converting the GGUFs from other SFT forms is that in many cases it's just leaving tensors as BF16 when they could be perfectly fine as Q8, or even less. The goal is to try to get the model as small as possible and balance out quality/speed, I've just finished another big update to the quantizer to do just that. I will try your model next to keep testing. Do you have an imatrix you already would use for it or a dataset?

@michaelw9999
BTW, I am very excited about your quantizer and very frustrated at the same time because I can't build it in Windows, any news on that front? I was daydreaming about a useful feature that I would like: fit a high precision model into a given GPU VRAM, meaning finding the optimal quantizations for different layers so the final GGUF along with the context fits in the user's GPU VRAM (even if it means very drastic low bit quantizations for some layers)

michaelw9999 Jun 5, 2026

@jovan2009 I'll take a look at it. One of the challenges in converting the GGUFs from other SFT forms is that in many cases it's just leaving tensors as BF16 when they could be perfectly fine as Q8, or even less. The goal is to try to get the model as small as possible and balance out quality/speed, I've just finished another big update to the quantizer to do just that. I will try your model next to keep testing. Do you have an imatrix you already would use for it or a dataset?

@michaelw9999 BTW, I am very excited about your quantizer and very frustrated at the same time because I can't build it in Windows, any news on that front? I was daydreaming about a useful feature that I would like:

Sorry I've just been pegging the GPU nonstop at finishing it up and getting the last models quantized up, and would have to use a laptop to try to build on Windows. Why not try with wsl? It will work just fine in there and I imagine llama-server would be faster on Linux too, than Windows, but don't quote me on that.

fit a high precision model into a given GPU VRAM, meaning finding the optimal quantizations for different layers so the final GGUF along with the context fits in the user's GPU VRAM (even if it means very drastic low bit quantizations for some layers)

I've been quantizing and working on Qwen3.6-27B-NVFP4-Small to fit onto 16GB VRAM and try to keep the NVFP4 speed up as much as possible :) Will post soon.

jovan2009 Jun 5, 2026
Author

Sorry I've just been pegging the GPU nonstop at finishing it up and getting the last models quantized up, and would have to use a laptop to try to build on Windows. Why not try with wsl? It will work just fine in there and I imagine llama-server would be faster on Linux too, than Windows, but don't quote me on that.

There are multiple reasons (meaning excuses, LoL). With learning how to compile from source and vibecoding little Python scripts I'm already stretching my abilities way beyond my lane. I'm not a developer but a graphic designer long time tied to Windows graphic apps. With AI era my main interest is ComfyUI but llama.cpp is my "new love". I noticed the lack of Windows support when it comes to various AI related Python packages and from what I heard I would get much more performance under Linux. I considered Linux/WSL but I need first a thorough cleanup/backup of my disks and a reformat for my Windows ingrained brain (which is probably more difficult). For now I simply have no space for another OS, I am running on "fumes". But somehow I am always more curious/excited to try a new model instead of doing housekeeping, LoL.

michaelw9999 Jun 5, 2026

I've been quantizing and working on Qwen3.6-27B-NVFP4-Small to fit onto 16GB VRAM and try to keep the NVFP4 speed up as much as possible :) Will post soon.

Give Qwen3.6-27B-NVFP4-SMALL-MTP-GGUF a try and see if that one works for you. This is a predominant Q3_K/NVFP4 blend. It might be very tight on the context you can fit before it runs out of vRAM . Not quite as fast as the larger version but at least you might be able to run it without any spilling.

michaelw9999 · 2026-06-05T19:10:13Z

michaelw9999
Jun 5, 2026

Yes... MXFP6 and MXFP8 both are as well. I've tried both for kv cache and It is not faster. It takes more time to generate the scales. FP8 , on the other hand, is very fast.

…

On Fri, Jun 5, 2026, 12:04 PM jovan2009 ***@***.***> wrote: @michaelw9999 <https://github.com/michaelw9999> I apologize if this is a silly question: are there any quants in llama.cpp beside nvfp4 that are "accelerated" on Blackwell the same way nvfp4 is? Maybe your mxfp6? I'm interested in quants that are used for kv cache (for which nvfp4 is not available). I notice a dramatic slowdown with any model with the increase in "consumed" context window and I wonder if "more accelerated" quants would not help. Lately I notice that with kv_cache q5_0 and q4_0 I'm not getting a completely dumb model and I'm thinking that mxfp6 + nvfp4 would be really nice for kv_cache if that would be possible. In ComfyUI territory there is even a mxfp8 quant supported by https://github.com/silveroxides/convert_to_quant, I'm not sure if or how would that translate to llama.cpp. — Reply to this email directly, view it on GitHub <#23627?email_source=notifications&email_token=BTEDPHESVJF7337K6PI6EH346MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSWGM33PORSXEX3DNRUWG2Y#discussioncomment-17196919>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BTEDPHA33HEI4ALC23QE6W346MKS3AVCNFSM6AAAAACZLKINA2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTOMJZGY4TCOI> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/BTEDPHH4MPGZJUEJCPWIL3L46MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSVGM33PORSXEX3JN5ZQ> and Android <https://github.com/notifications/mobile/android/BTEDPHB2H6UEDHE25VREYUT46MKS3A5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNZRHE3DSMJZUZZGKYLTN5XKO3LFNZ2GS33OUVSXMZLOOSXGM33PORSXEX3BNZSHE33JMQ>. Download it today! You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

jovan2009 Jun 5, 2026
Author

I see, generating the optimal scales I suppose isn't something that can be done in advance for a particular model in the manner of immtrix.gguf.

michaelw9999 · 2026-06-05T19:45:05Z

michaelw9999
Jun 5, 2026

No, it is actually compressing the words and the content that you type into the AI, and the entire context history of the session.

0 replies

jovan2009 · 2026-06-06T03:01:31Z

jovan2009
Jun 6, 2026
Author

Google released a bunch of QAT q4_0 Gemma 4 models. https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-unquantized I feel they are potential good candidates for "pure" NVFP4. Unsloth's GGUF version https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF runs pretty Ok on my system (~2x the speed of NVFP4 Devstral but still slow: ~2.1 t/s inference at 91% context window filled out of 256K, kv_cache q8_0/q8_0)

0 replies

Unsuccessful attempt at obtaining a working NVFP4 GGUF from a working NVFP4 model. #23627

Uh oh!

Replies: 6 comments · 15 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jovan2009 Jun 5, 2026 Author

Uh oh!

Uh oh!

jovan2009 Jun 4, 2026 Author

Uh oh!

Uh oh!

jovan2009 Jun 5, 2026 Author

Uh oh!

jovan2009 Jun 5, 2026 Author

Uh oh!

Uh oh!

jovan2009 Jun 5, 2026 Author

Uh oh!

Uh oh!

Uh oh!

jovan2009 Jun 5, 2026 Author

Uh oh!

Uh oh!

jovan2009 Jun 6, 2026 Author

Replies: 6 comments 15 replies

jovan2009 Jun 5, 2026
Author

jovan2009
Jun 4, 2026
Author

jovan2009 Jun 5, 2026
Author

jovan2009 Jun 5, 2026
Author

jovan2009 Jun 5, 2026
Author

jovan2009 Jun 5, 2026
Author

jovan2009
Jun 6, 2026
Author