feat(dflash): add NVFP4 per-tensor scale2 support by phazei · Pull Request #146 · Luce-Org/lucebox-hub

phazei · 2026-05-10T07:52:22Z

Although I did this, and tested it, after all this, I didn't get the results I was hoping for. Basically, benchmarking it, NVFP4-Q8-GGUF was the same speed as regular Q4. Perhaps it was due to Q8 and if I used NVFP4-Q4-GGUF, it would have been faster. For this to work, it needs submodule llama.cpp updates as well, just removal of some guards stopping sm_120 from using sm_120a code.

Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4) by loading per-tensor weight scales and applying them in the target graph.

Scale values are read as host-side floats from the GGUF mmap at load time and applied via ggml_scale() — a compile-time scalar multiply with zero extra kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which adds 768 kernel launches per forward pass and causes ~30x overhead in batched DDTree verify mode (1001ms -> 43ms per step on RTX 5090).

Supports both naming conventions:

LibertAI: blk.N.ffn_gate.scale
Heretic: blk.N.ffn_gate.weight.scale

Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f and apply_scale2() returns early with zero overhead.

Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores.

Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4 models still work correctly via the generic dequant-to-Q8_1 fallback path.

Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4) by loading per-tensor weight scales and applying them in the target graph. Scale values are read as host-side floats from the GGUF mmap at load time and applied via ggml_scale() — a compile-time scalar multiply with zero extra kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which adds 768 kernel launches per forward pass and causes ~30x overhead in batched DDTree verify mode (1001ms -> 43ms per step on RTX 5090). Supports both naming conventions: - LibertAI: blk.N.ffn_gate.scale - Heretic: blk.N.ffn_gate.weight.scale Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f and apply_scale2() returns early with zero overhead. Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores. Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4 models still work correctly via the generic dequant-to-Q8_1 fallback path.

cubic-dev-ai

No issues found across 4 files

cubic-dev-ai Bot reviewed May 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): add NVFP4 per-tensor scale2 support#146

feat(dflash): add NVFP4 per-tensor scale2 support#146
phazei wants to merge 1 commit into
Luce-Org:mainfrom
phazei:feat/nvfp4-scale2-support

phazei commented May 10, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phazei commented May 10, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant