Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs#486
Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs#486
Conversation
…own3D, DupUp3D, and Wan2_2_VAE wrapper class
- Add VAEArchitectureConfig for encoder/decoder configuration - Add VAEEncodingConfig for encoding parameters - Add VAEModelConfig for complete model configuration - Implement VAEConfigManager with full CRUD operations - Support JSON serialization/deserialization - Include predefined configs for Wan2.1 and Wan2.2 - Add config cloning, updating, saving, and loading - Support batch import/export operations
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…it__ Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add NVFP4 (4-bit floating point) quantization support for Blackwell GPUs
|
NVFP4 modelis download link: |
|
Model: 3b_nvfp4 |
|
Model: 3bQ8 |
…ackwell GPUs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…fication, remove private APIs Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add NVFP4 async offloading and pinned memory for Blackwell GPU optimization
|
|
|
run_nvidia_gpu_fast_fp16_accumulation - firefox.zip You need to create the startup file like this for it to become active. |
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
….h when unavailable Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell
…Memory, torch.compile for Windows/Blackwell"
…-gpu-memory Revert "Integrate NVIDIA GPU Optimizations: Async Offloading, Pinned Memory, torch.compile for Windows/Blackwell"
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…ell support Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
…ization-again Integrate SpargeAttn/Sage2 block-sparse attention with Blackwell GPU optimizations
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Co-authored-by: naxci1 <206254294+naxci1@users.noreply.github.com>
Add performance_mode dropdown for Blackwell-specific sparge_sage2 tuning
|
I recommend building sageattention2,3 from source, since pip install points to older sageattention1 pip install packaging setuptools ninja install sageattention 3cd ~ install sageattention 2++ for fallbackcd ~ |
|
While version 2.5.23 used 15 GB of VRAM, this version uses 14 GB. The processing speed was 1.8 fps in version 2.5.23, but it's 1.91 fps in this version. Although the NVFP4 model doesn't give me exactly the results I want, it's worthwhile to add the other improvements to the main code. |
|
@naxci1 VAE Decoding step is the slowest. So can we replace its fp16 model with nvfp4 as well? |
I tried many different methods, but unfortunately it didn't work. It didn't work with the nvfp4 VAE model. The SeedVR2 DNA is very differently designed. It doesn't accept other VAE models; it absolutely has to be this VAE, which is very slow, unfortunately. The main slowdown occurs when decoding the VAE. I hope SeedVR3 comes out soon, with new codes and a new VAE model. The coding of this old model is very complex. |
|
@naxci1 are you 100% sure you implemented nvfp4 correctly? I'd like to try your fork on an RTX 5090 today, but I'm put off by the fact that you aren't getting any speedup... |
|
https://huggingface.co/Nexus24/vaeGGUF/tree/main?show_file_info=vae_nvfp4_blackwell.safetensors Here you can see the details of the model yourself; theoretically, everything is complete. Claude just couldn't fully optimize it, and it took me several days. The problem is that the SeedVR2 code is very complex, there are unnecessary repetitions, its DNA is very different, and the DIT and VAE are very interconnected, unlike other models. In FlashVSR, I can integrate what I want much more easily, but SeedVR2 actually needs to be rewritten from scratch. That's why experts in this field need to get involved, and since they don't have the time, it remains unfinished. After all, I'm not a programmer; I'm just doing vibe coding. |
|
@naxci1, I'm testing out the latest commit with the Blackwell optimizations right now (the one without NVFP4, since that’s on another branch). |
|
Which one do you mean? Write the link. |
|
@zelenooki87 Thanks I had left this unfinished, I forgot about it. |
|
I just tested it with the 3bQ8 model and the speed was the same for me. I have a 5070Ti GPU and you have a 5090. What were the FPS differences between the old and new FPS for you? |
|
@naxci1 , how is the nvfp4 integration going? I saw you created a new fork/repo with the -new suffix. |
|
Hi @zelenooki87 Actually, I've integrated and tested many new methods and technologies, but the tests take a very long time, and it leaves me frustrated, so I leave them unfinished. Unfortunately, AI isn't very smart when it comes to coding. Sometimes it claims to have done something, but when you look, it hasn't actually done it, which is why the tests take so long. There are many branches in my repository, all of which I've left unfinished. I abandon them when I don't get the results I want in the tests. I still haven't achieved the speed I want, unfortunately. SeedVR2 actually needs to be rewritten; it's very old and uses outdated methods, especially optimized for the H100 GPU. Therefore, changing some methods creates other problems, leading to wasted time. @IceClear actually wrote that he would be making SeedVR3; it would have been better if he had rewritten it. Complete code and methods would have been changed, and if it had been built for RT Tensors and especially if VAE had been chosen, integrating new features would have been much easier. Honestly, I don't have much time these days either. It's easier to fix code in FlashVSR; I can do everything in an hour, but integrating new features in SeedVR2 takes days. The code DNA is so complex that even Claude can't handle it. |
|
I know I shouldn't be talking as a 9070XT user here, but the "old" seedvr2_7b_nvfp4.safetensors model is working with ROCm. Though none of the new ones (like seedvr2_nvfp4_blackwell.safetensors) do. Seems to execute faster too than the current 7b fp16 model. |
|
Actually, I could convert those models again; it only takes a minute to convert, but it didn't give me the speed I wanted, so I deleted them. The highest quality and fastest model is the 3bQ8; I recommend using that one. |
Which one is that model? Is it named differently in your HuggingFace repo? Also, if you could reconvert and upload them somewhere I'd love to test them again. Thanks! |
|
This is SeedVR2's own model; it downloads as standard, and its name is listed in the model section: 3B Q8 model. |

src/optimization/nvfp4.pywith E2M1 weight format support and E4M3 scaling factorssrc/core/model_loader.pyto detect and load NVFP4.safetensorsweightssrc/optimization/compatibility.pyoptimization/__init__.pyand use lazy imports indit_model_loader.py