Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
928ab1c
Add NVFP4 per-token GEMM, fused grouped amax, cast, tests and benches
cael-ling May 26, 2026
c378056
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 27, 2026
f10e505
Add optional col-wise RHT to NVFP4 per-token amax+quant (single + gro…
cael-ling May 28, 2026
1f43683
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 28, 2026
2d2899a
Fuse rowwise SF swizzle into NVFP4 per-token K2 + bench scaffolding
cael-ling May 29, 2026
15a24ab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 29, 2026
2b77211
Add CUTLASS NVFP4 GEMM with per-token rescale fused into EVT epilogue
cael-ling May 30, 2026
9d7f381
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 30, 2026
844def6
Add NVFP4 per-token E2E fwd/bwd benchmarks and accuracy tests
cael-ling May 31, 2026
0742286
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2026
558ed14
Add per-token NVFP4 backward path (dgrad/wgrad) and opt-in RHT
cael-ling Jun 2, 2026
f33c0a4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 2, 2026
8b8c607
Add stochastic rounding to the per-token NVFP4 path
cael-ling Jun 2, 2026
7d6b782
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 2, 2026
4501480
Add stochastic rounding to grouped NVFP4 per-token quantize
cael-ling Jun 4, 2026
294bab9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 4, 2026
eadd276
Add opt-in 2D weight quantization for the NVFP4 per-token recipe
cael-ling Jun 8, 2026
bb7bac6
Add docs and Megatron-Core example for the NVFP4 per-token recipe
cael-ling Jun 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api/common.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ Common API

.. autoapiclass:: transformer_engine.common.recipe.NVFP4BlockScaling(fp4_format=Format.E2M1)

.. autoapiclass:: transformer_engine.common.recipe.NVFP4PerTokenBlockScaling(fp4_format=Format.E2M1)

.. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)

.. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)
Expand Down
42 changes: 42 additions & 0 deletions docs/envvars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,48 @@ Kernel Configuration
:Default: ``0``
:Description: Enable row-scaled NVFP4 tensors for forward activation quantizers in the ``NVFP4BlockScaling`` recipe. When set to ``1`` (or when ``NVFP4BlockScaling(row_scaled_activation=True)`` is used), rowwise ``amax`` metadata is stored as one FP32 value per tensor row instead of a single scalar.

.. envvar:: NVTE_NVFP4_DISABLE_RHT

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Opt out of the random Hadamard transform (RHT) in the per-tensor ``NVFP4BlockScaling`` recipe. RHT is applied by default to the forward activation and backward gradient quantizers. Set to ``1`` (or use ``NVFP4BlockScaling(disable_rht=True)``) to disable it. No effect on the per-token path (see :envvar:`NVTE_NVFP4_PER_TOKEN_RHT`).

.. envvar:: NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Opt out of stochastic rounding (SR) in the per-tensor ``NVFP4BlockScaling`` recipe. SR is applied by default to the backward gradient quantizer. Set to ``1`` (or use ``NVFP4BlockScaling(disable_stochastic_rounding=True)``) to disable it. No effect on the per-token path (see :envvar:`NVTE_NVFP4_PER_TOKEN_SR`).

.. envvar:: NVTE_NVFP4_DISABLE_2D_QUANTIZATION

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Opt out of 2D (16x16 inner tile + scalar outer amax) weight quantization in the per-tensor ``NVFP4BlockScaling`` recipe. 2D weight quantization is enabled by default. Set to ``1`` (or use ``NVFP4BlockScaling(disable_2d_quantization=True)``) to fall back to 1D (16-element block) weight quantization. Forced on the per-token path (the per-token cast hard-disables 2D); see :envvar:`NVTE_NVFP4_PER_TOKEN_WEIGHT_2D` for the per-token weight-2D route.

.. envvar:: NVTE_NVFP4_PER_TOKEN

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Flip a plain ``NVFP4BlockScaling`` recipe into per-token mode (per-row / per-col outer ``amax`` cast plus the fused-EVT CUTLASS GEMM) without changing the recipe class. This lets frameworks that already construct a default ``NVFP4BlockScaling`` (e.g. Megatron-Core with ``--fp4-format e2m1``) opt into per-token purely from the launch environment. Equivalent to constructing the explicit ``NVFP4PerTokenBlockScaling`` recipe. The per-token forward path currently requires the unfused norm+amax path: also set ``NVTE_NORM_FWD_USE_CUDNN=1`` (the fused norm+amax path rejects per-token quantizers).

.. envvar:: NVTE_NVFP4_PER_TOKEN_RHT

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Per-token only. Opt into the random Hadamard transform (RHT) on the per-token forward activation and backward gradient quantizers. Per-token disables RHT by default (its per-row outer amax already mitigates the long-tail outliers RHT targets); set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_rht=True)``) to re-enable it. No effect on the per-tensor path.

.. envvar:: NVTE_NVFP4_PER_TOKEN_SR

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Per-token only. Opt into stochastic rounding (SR) on the per-token backward gradient quantizer (the K2 encode kernel implements a Philox-dithered FP4 cast). Per-token disables SR by default; set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_sr=True)``) to re-enable it. No effect on the per-tensor path.

.. envvar:: NVTE_NVFP4_PER_TOKEN_WEIGHT_2D

:Type: ``int`` (0 or 1)
:Default: ``0``
:Description: Per-token only. Quantize the forward weight with the per-tensor 2D cast (16x16 inner tile + scalar outer amax) emitted in per-token layout, instead of the per-token 1D weight cast. 2D weight quantization is transposition-invariant, so forward (rowwise) and dgrad (columnwise) see the same weight, removing the 1D path's weight-gradient bias. Activations and gradients stay on the standard per-token 1D cast. Set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_weight_2d=True)``). No effect on the per-tensor path.

.. envvar:: NVTE_NVFP4_4OVER6

:Type: ``str`` (``none``, ``weights``, ``activations``, or ``all``)
Expand Down
83 changes: 83 additions & 0 deletions docs/features/low_precision_training/nvfp4/nvfp4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,89 @@ NVFP4 all-gather is supported.

*Figure 6. Quantization and all-gather flow for NVFP4 showing amax synchronization and hierarchical scaling.*

Per-token NVFP4
---------------

The default ``NVFP4BlockScaling`` recipe computes a single per-tensor outer
``amax`` (``s_global``) for each tensor. The **per-token** variant instead
computes a per-row outer ``amax`` (length ``M``) for rowwise data and a per-col
outer ``amax`` (length ``K``) for columnwise data, giving each token/row its own
global scale. This finer outer-scale granularity can improve accuracy, and the
per-token cast feeds a dedicated fused-EVT CUTLASS GEMM that consumes the vector
outer ``amax`` directly (cuBLASLt cannot).

There are two ways to select per-token, both equivalent:

* **Explicit recipe class** ``NVFP4PerTokenBlockScaling`` (recommended for code
that constructs its own recipe).
* **Environment variable** ``NVTE_NVFP4_PER_TOKEN=1`` on a plain
``NVFP4BlockScaling``. This lets frameworks that only ever build a default
``NVFP4BlockScaling`` (for example Megatron-Core) opt into per-token purely
from the launch environment, with no framework-side code change.

.. code-block:: python

from transformer_engine.common.recipe import NVFP4PerTokenBlockScaling
import transformer_engine.pytorch as te

# RHT and SR are OFF by default on the per-token path; opt in as needed.
recipe = NVFP4PerTokenBlockScaling(per_token_rht=True, per_token_sr=True)
with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
out = model(inp)

**Differences from the per-tensor default**

* RHT and stochastic rounding are **off by default** on the per-token path (the
per-row outer ``amax`` already mitigates the long-tail outliers RHT targets).
Opt in with ``per_token_rht=True`` / ``per_token_sr=True`` (env vars
:envvar:`NVTE_NVFP4_PER_TOKEN_RHT` / :envvar:`NVTE_NVFP4_PER_TOKEN_SR`).
* 2D weight quantization is disabled by default. The per-token weight-2D route
(``per_token_weight_2d=True`` / :envvar:`NVTE_NVFP4_PER_TOKEN_WEIGHT_2D`)
quantizes the forward weight with the transposition-invariant 2D cast emitted
in per-token layout, removing the 1D weight-gradient bias.
* ``row_scaled_activation`` and 4over6 are forced off (mutually exclusive with
the per-token amax layout).

**Requirement: unfused norm forward.** The per-token forward path requires the
unfused norm+amax implementation; the fused norm+amax path rejects per-token
quantizers. When the first GEMM consumes a fused norm output (for example
``LayerNormLinear``), also set ``NVTE_NORM_FWD_USE_CUDNN=1``.

**Currently unsupported on the per-token path**: ``fuse_wgrad_accumulation=True``,
forward/backward output quantization, and communication/bulk overlap.

Running per-token NVFP4 with Megatron-Core
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Megatron-Core builds a plain ``NVFP4BlockScaling`` for ``--fp4-format e2m1`` and
has no CLI for per-token, so per-token is selected entirely through TE
environment variables. A minimal launch looks like:

.. code-block:: bash

# Select NVFP4 per-token via TE env vars (read at recipe construction).
export NVTE_NVFP4_PER_TOKEN=1
export NVTE_NORM_FWD_USE_CUDNN=1 # required: unfused norm forward
# Optional per-token knobs:
# export NVTE_NVFP4_PER_TOKEN_RHT=1
# export NVTE_NVFP4_PER_TOKEN_SR=1
# export NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1

python pretrain_gpt.py \
--transformer-impl transformer_engine \
--fp4-format e2m1 \
--no-gradient-accumulation-fusion \
... # remaining model / data / optimizer args

Notes:

* ``--no-gradient-accumulation-fusion`` is required because the per-token kernel
does not yet support fused wgrad accumulation.
* To keep the first/last transformer layers in BF16, use Megatron's
``--first-last-layers-bf16 --num-layers-at-start-in-bf16 N
--num-layers-at-end-in-bf16 M`` flags (those layers simply skip the FP4
autocast; the recipe is unchanged).

Examples
--------

Expand Down
140 changes: 140 additions & 0 deletions examples/pytorch/nvfp4_per_token_megatron/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# NVFP4 per-token training with Megatron-Core

This example shows how to train a small Mixture-of-Experts (MoE) model with
the **NVFP4 per-token** quantization recipe on a single GPU using
[Megatron-Core](https://github.com/NVIDIA/Megatron-LM), and how to compare it
against the per-tensor NVFP4 recipe and an unquantized BF16 baseline.

The same model / data / seed are used across all modes; only the GEMM precision
changes, so the runs are directly comparable.

## How per-token interacts with Megatron-Core

Megatron-Core builds a plain `transformer_engine.common.recipe.NVFP4BlockScaling`
for `--fp4-format e2m1` and has **no CLI flag for per-token**. Per-token is
selected entirely through Transformer Engine environment variables, read when the
recipe is constructed:

| Variable | Effect |
| --- | --- |
| `NVTE_NVFP4_PER_TOKEN=1` | **Required**: Flip the recipe into per-token mode (per-row/per-col outer amax + fused CUTLASS GEMM) |
| `NVTE_NORM_FWD_USE_CUDNN=1` | **Required** with per-token: forces the unfused norm forward (the fused norm+amax path rejects per-token currently) |
| `NVTE_NVFP4_PER_TOKEN_RHT=1` | Opt into the random Hadamard transform (off by default) |
| `NVTE_NVFP4_PER_TOKEN_SR=1` | Opt into stochastic rounding (off by default) |
| `NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1` | Use the transposition-invariant 2D weight cast in per-token layout |

For the per-tensor recipe, the analogous knobs are
`NVTE_NVFP4_DISABLE_RHT`, `NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING`, and
`NVTE_NVFP4_DISABLE_2D_QUANTIZATION`.

See the
[NVFP4 documentation](../../../docs/features/low_precision_training/nvfp4/nvfp4.rst)
("Per-token NVFP4") and `docs/envvars.rst` for full details. Equivalently, code
that constructs its own recipe can use the public
`transformer_engine.common.recipe.NVFP4PerTokenBlockScaling` class instead of the
env var.

Keeping the first/last transformer layers in BF16 is a Megatron-Core CLI feature
(`--first-last-layers-bf16 --num-layers-at-start-in-bf16 N
--num-layers-at-end-in-bf16 M`); those layers simply skip the FP4 autocast. This
is also supported with the per-token recipe.

## Prerequisites

- A Blackwell GPU (SM100+) — NVFP4 training requires it.
- Transformer Engine built from this repository **with per-token support**
(`NVTE_CUDA_ARCHS=100a NVTE_BUILD_THREADS_PER_JOB=8 NVTE_FRAMEWORK=pytorch pip install -e . --no-build-isolation`).
- A [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) checkout (provides
`pretrain_gpt.py` and the `megatron` package).
- A tokenized dataset and tokenizer (the scripts default to an OLMo-1124 corpus
with the Moonlight-16B-A3B tokenizer).
- For Weights & Biases logging: authenticate with `wandb login` or export
`WANDB_API_KEY` in your environment.

## Files

| File | Purpose |
| --- | --- |
| `run_moe_nvfp4_singlegpu.sh` | Core launcher. Run **inside** the container from a shell that can see `pretrain_gpt.py`. Takes one mode: `bf16`, `prod` (== `pertensor`), or `pertoken`. |
| `sbatch_moe_nvfp4_singlegpu.sh` | Slurm wrapper: starts the container, (re)installs the editable TE build, then runs one or more variants (so far one GPU each). |
| `submit_chain.sh` | Submit a chain of dependent Slurm jobs that auto-resume from the stable checkpoint dir. |

## Quick start (standalone, inside a container)

```bash
# Point at your Megatron-LM checkout, data, and tokenizer.
export MLM_DIR=/path/to/Megatron-LM
export DATA_PATH=/path/to/datasets/your_data
export TOKENIZER_MODEL=/path/to/tokenizers/Moonlight-16B-A3B
export TRAIN_ITERS=2000

bash run_moe_nvfp4_singlegpu.sh pertoken

# Compare against per-tensor NVFP4 and BF16:
bash run_moe_nvfp4_singlegpu.sh pertensor
bash run_moe_nvfp4_singlegpu.sh bf16
```

To enable per-token RHT / SR / 2D-weight, export the knobs before launching:

```bash
export NVTE_NVFP4_PER_TOKEN_RHT=1
export NVTE_NVFP4_PER_TOKEN_SR=1
export NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1
bash run_moe_nvfp4_singlegpu.sh pertoken
```

## Slurm

Edit the **host-side config block** at the top of
`sbatch_moe_nvfp4_singlegpu.sh` (Slurm account, container `IMAGE`, `HOST_MOUNT`,
`TE_DIR`, and `HOST_LOG_DIR` / the `#SBATCH --output/--error` paths) for your
cluster, then:

```bash
# One mode:
sbatch sbatch_moe_nvfp4_singlegpu.sh pertoken

# Up to 4 variants concurrently (one GPU each):
sbatch sbatch_moe_nvfp4_singlegpu.sh "bf16,pertensor+rht+sr,pertoken"

# Override knobs via --export:
sbatch --export=ALL,TRAIN_ITERS=2000,SEED=42 sbatch_moe_nvfp4_singlegpu.sh pertoken
```

Spec syntax: `<mode>[+rht][+sr][+1d][+2d][+fb]` where `mode` is
`bf16 | prod (== pertensor) | pertoken`. `+rht`/`+sr` turn those features on,
`+1d` forces 1D weights (per-tensor only), `+2d` enables the per-token 2D-weight
route, and `+fb` keeps the first/last layers in BF16.

For runs that exceed one Slurm wall-clock window, chain dependent jobs that
resume from the stable per-variant checkpoint dir:

```bash
CHAIN=3 bash submit_chain.sh \
--export=ALL,IMAGE=/path/to/te_pertoken.sqsh,SKIP_BUILD=1,TRAIN_ITERS=60000 \
sbatch_moe_nvfp4_singlegpu.sh pertoken
```

## Notes and current limitations

The per-token recipe is currently intended for **accuracy evaluation and
comparison** (per-token vs per-tensor vs BF16), **not** for optimized production
deployment. Concretely:

- **Requires `NVTE_NORM_FWD_USE_CUDNN=1`** (the unfused cuDNN norm forward).
The fused norm+amax path (`NVTE_NORM_FWD_USE_CUDNN=0`, the default) does **not**
support per-token and is rejected at the C++ quantizer. The launcher sets this
for you in `pertoken` mode.
- **Not tested with CUDA graphs.** The per-token path has not been validated under
Megatron's CUDA graph capture; leave CUDA graphs disabled for now.
- **Kernels are not yet performance-optimal.** Several per-token cast / GEMM
kernels are functional but not tuned, so wall-clock throughput is not
representative of the recipe's eventual performance. Use this example for
numerical/accuracy comparison, not perf benchmarking.
- `--no-gradient-accumulation-fusion` is required: the per-token kernel does not
yet support fused wgrad accumulation. The scripts set it for every mode so
only the GEMM precision differs.
- The example reduces the MoE expert count to 64 so all experts stay local at
EP=1 on a single GPU (TE's grouped-NVFP4 kernels cap at 64 tensors per launch).
Real training shards experts via EP>1.
Loading