Skip to content

Add NVFP4 per-token quantization recipe#3045

Draft
cael-ling wants to merge 2 commits into
NVIDIA:mainfrom
cael-ling:feature/nvfp4-per-token-recipe
Draft

Add NVFP4 per-token quantization recipe#3045
cael-ling wants to merge 2 commits into
NVIDIA:mainfrom
cael-ling:feature/nvfp4-per-token-recipe

Conversation

@cael-ling
Copy link
Copy Markdown
Contributor

@cael-ling cael-ling commented May 26, 2026

Description

This PR adds an NVFP4 per-token quantization fast path for bf16 inputs on Blackwell (SM100+) for Model Pre-training. Per-token uses per-row / per-column outer amax instead of the per-tensor scalar amax, which factors cleanly out of the GEMM K-summation and lets the inner GEMM stay on production cuBLASLT NVFP4 plus a thin trailing post-scale -- no CUTLASS fork required.

Status: draft. The cast kernel and GEMM composite, byte-equal-verified against a Python reference, and benched against the per-tensor (RHT + SR) recipe are still in progress. Partial experimental results are shown as follows.

Cast-only (fwd/bwd: amax+cast) performance comparisons in Eager mode:

M K N per-token (us) per-tensor (us) per-token/per-tensor
128 4096 4096 61.4 106.8 0.575
512 4096 4096 58.7 101.4 0.578
1024 4096 4096 57.1 105.8 0.540
4096 4096 4096 62.3 101.9 0.612
8192 4096 4096 61.6 104.9 0.587
16384 4096 4096 86.0 118.2 0.728
32768 4096 4096 146.5 198.0 0.740

Cast-only (fwd/bwd: amax+cast) performance comparisons in Graph mode:

M K N per-token (us) per-tensor (us) per-token/per-tensor
128 4096 4096 16.7 28.9 0.579
512 4096 4096 17.8 28.0 0.637
1024 4096 4096 19.5 37.0 0.527
4096 4096 4096 28.9 39.2 0.737
8192 4096 4096 46.5 67.3 0.691
16384 4096 4096 80.3 120.2 0.668
32768 4096 4096 138.7 199.8 0.694

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Each row gets its own outer amax instead of sharing a per-tensor
scalar; the FP4 + 1x16 e4m3 inner SF layout stays bit-identical to
the per-tensor production cast so the output is consumable by cuBLAS
LT NVFP4 with no CUTLASS fork. Ships the single-tensor cast (TMA +
mbarrier + 64x64 sub-tile pipeline with K1 amax / K2 encode /
composite entries), a grouped multi-tensor variant for MoE / expert-
batched paths, plus matching C-API, Pybind, and Python wrapper.
The cuBLASLT GEMM + per-row post-scale wrapper and tests / benches
land in a follow-up commit on this branch.

Signed-off-by: Cael Ling <caell@nvidia.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant