feat(dflash): add NVFP4 per-tensor scale2 support#146
Open
phazei wants to merge 1 commit into
Open
Conversation
Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4) by loading per-tensor weight scales and applying them in the target graph. Scale values are read as host-side floats from the GGUF mmap at load time and applied via ggml_scale() — a compile-time scalar multiply with zero extra kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which adds 768 kernel launches per forward pass and causes ~30x overhead in batched DDTree verify mode (1001ms -> 43ms per step on RTX 5090). Supports both naming conventions: - LibertAI: blk.N.ffn_gate.scale - Heretic: blk.N.ffn_gate.weight.scale Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f and apply_scale2() returns early with zero overhead. Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores. Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4 models still work correctly via the generic dequant-to-Q8_1 fallback path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Although I did this, and tested it, after all this, I didn't get the results I was hoping for. Basically, benchmarking it, NVFP4-Q8-GGUF was the same speed as regular Q4. Perhaps it was due to Q8 and if I used NVFP4-Q4-GGUF, it would have been faster. For this to work, it needs submodule llama.cpp updates as well, just removal of some guards stopping sm_120 from using sm_120a code.
Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4) by loading per-tensor weight scales and applying them in the target graph.
Scale values are read as host-side floats from the GGUF mmap at load time and applied via ggml_scale() — a compile-time scalar multiply with zero extra kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which adds 768 kernel launches per forward pass and causes ~30x overhead in batched DDTree verify mode (1001ms -> 43ms per step on RTX 5090).
Supports both naming conventions:
Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f and apply_scale2() returns early with zero overhead.
Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores.
Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4 models still work correctly via the generic dequant-to-Q8_1 fallback path.