Add NVFP4 per-token quantization recipe by cael-ling · Pull Request #3045 · NVIDIA/TransformerEngine

cael-ling · 2026-05-26T15:18:26Z

Description

This PR adds an NVFP4 per-token quantization fast path for bf16 inputs on Blackwell (SM100+) for Model Pre-training. Per-token uses per-row / per-column outer amax instead of the per-tensor scalar amax, which factors cleanly out of the GEMM K-summation and lets the inner GEMM stay on production cuBLASLT NVFP4 plus a thin trailing post-scale -- no CUTLASS fork required.

Status: draft. The cast kernel and GEMM composite, byte-equal-verified against a Python reference, and benched against the per-tensor (RHT + SR) recipe are still in progress. Partial experimental results are shown as follows.

Cast-only (fwd/bwd: amax+cast) performance comparisons in Eager mode:

M	K	N	per-token (us)	per-tensor (us)	per-token/per-tensor
128	4096	4096	61.4	106.8	0.575
512	4096	4096	58.7	101.4	0.578
1024	4096	4096	57.1	105.8	0.540
4096	4096	4096	62.3	101.9	0.612
8192	4096	4096	61.6	104.9	0.587
16384	4096	4096	86.0	118.2	0.728
32768	4096	4096	146.5	198.0	0.740

Cast-only (fwd/bwd: amax+cast) performance comparisons in Graph mode:

M	K	N	per-token (us)	per-tensor (us)	per-token/per-tensor
128	4096	4096	16.7	28.9	0.579
512	4096	4096	17.8	28.0	0.637
1024	4096	4096	19.5	37.0	0.527
4096	4096	4096	28.9	39.2	0.737
8192	4096	4096	46.5	67.3	0.691
16384	4096	4096	80.3	120.2	0.668
32768	4096	4096	138.7	199.8	0.694

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Each row gets its own outer amax instead of sharing a per-tensor scalar; the FP4 + 1x16 e4m3 inner SF layout stays bit-identical to the per-tensor production cast so the output is consumable by cuBLAS LT NVFP4 with no CUTLASS fork. Ships the single-tensor cast (TMA + mbarrier + 64x64 sub-tile pipeline with K1 amax / K2 encode / composite entries), a grouped multi-tensor variant for MoE / expert- batched paths, plus matching C-API, Pybind, and Python wrapper. The cuBLASLT GEMM + per-row post-scale wrapper and tests / benches land in a follow-up commit on this branch. Signed-off-by: Cael Ling <caell@nvidia.com>

for more information, see https://pre-commit.ci

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 26, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f17fe4

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVFP4 per-token quantization recipe#3045

Add NVFP4 per-token quantization recipe#3045
cael-ling wants to merge 2 commits into
NVIDIA:mainfrom
cael-ling:feature/nvfp4-per-token-recipe

cael-ling commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cael-ling commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cael-ling commented May 26, 2026 •

edited

Loading