Skip to content

feat(dflash): add NVFP4 per-tensor scale2 support#146

Open
phazei wants to merge 1 commit into
Luce-Org:mainfrom
phazei:feat/nvfp4-scale2-support
Open

feat(dflash): add NVFP4 per-tensor scale2 support#146
phazei wants to merge 1 commit into
Luce-Org:mainfrom
phazei:feat/nvfp4-scale2-support

Conversation

@phazei
Copy link
Copy Markdown
Contributor

@phazei phazei commented May 10, 2026

Although I did this, and tested it, after all this, I didn't get the results I was hoping for. Basically, benchmarking it, NVFP4-Q8-GGUF was the same speed as regular Q4. Perhaps it was due to Q8 and if I used NVFP4-Q4-GGUF, it would have been faster. For this to work, it needs submodule llama.cpp updates as well, just removal of some guards stopping sm_120 from using sm_120a code.


Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4) by loading per-tensor weight scales and applying them in the target graph.

Scale values are read as host-side floats from the GGUF mmap at load time and applied via ggml_scale() — a compile-time scalar multiply with zero extra kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which adds 768 kernel launches per forward pass and causes ~30x overhead in batched DDTree verify mode (1001ms -> 43ms per step on RTX 5090).

Supports both naming conventions:

  • LibertAI: blk.N.ffn_gate.scale
  • Heretic: blk.N.ffn_gate.weight.scale

Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f and apply_scale2() returns early with zero overhead.

Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores.

Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4 models still work correctly via the generic dequant-to-Q8_1 fallback path.

Add support for NVFP4-quantized GGUF models (e.g. LibertAI Qwen3.6-27B-NVFP4)
by loading per-tensor weight scales and applying them in the target graph.

Scale values are read as host-side floats from the GGUF mmap at load time and
applied via ggml_scale() — a compile-time scalar multiply with zero extra
kernel launches. This avoids ggml_mul() with [1]-shaped GPU tensors, which
adds 768 kernel launches per forward pass and causes ~30x overhead in batched
DDTree verify mode (1001ms -> 43ms per step on RTX 5090).

Supports both naming conventions:
  - LibertAI: blk.N.ffn_gate.scale
  - Heretic:  blk.N.ffn_gate.weight.scale

Non-NVFP4 models (Q4_K_M etc) are unaffected — scale fields default to 1.0f
and apply_scale2() returns early with zero overhead.

Also removes the DFLASH27B_USE_BLACKWELL_CONSUMER_FIX CMake workaround, which
incorrectly assumed consumer Blackwell GPUs (RTX 5090) lack FP4 MMA
instructions. The RTX 5090 fully supports sm_120a and native FP4 tensor cores.

Note: full native FP4 MMA performance requires upstream PR ggml-org#22196 to
be merged into the Luce-Org llama.cpp submodule fork. Without it, NVFP4
models still work correctly via the generic dequant-to-Q8_1 fallback path.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant