Add NVFP4 per-token quantization recipe#3045
Draft
cael-ling wants to merge 2 commits into
Draft
Conversation
Each row gets its own outer amax instead of sharing a per-tensor scalar; the FP4 + 1x16 e4m3 inner SF layout stays bit-identical to the per-tensor production cast so the output is consumable by cuBLAS LT NVFP4 with no CUTLASS fork. Ships the single-tensor cast (TMA + mbarrier + 64x64 sub-tile pipeline with K1 amax / K2 encode / composite entries), a grouped multi-tensor variant for MoE / expert- batched paths, plus matching C-API, Pybind, and Python wrapper. The cuBLASLT GEMM + per-row post-scale wrapper and tests / benches land in a follow-up commit on this branch. Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds an NVFP4 per-token quantization fast path for bf16 inputs on Blackwell (SM100+) for Model Pre-training. Per-token uses per-row / per-column outer amax instead of the per-tensor scalar amax, which factors cleanly out of the GEMM K-summation and lets the inner GEMM stay on production cuBLASLT NVFP4 plus a thin trailing post-scale -- no CUTLASS fork required.
Status: draft. The cast kernel and GEMM composite, byte-equal-verified against a Python reference, and benched against the per-tensor (RHT + SR) recipe are still in progress. Partial experimental results are shown as follows.
Cast-only (fwd/bwd: amax+cast) performance comparisons in Eager mode:
Cast-only (fwd/bwd: amax+cast) performance comparisons in Graph mode:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: