NVIDIA · cael-ling · May 26, 2026 · May 27, 2026 · May 28, 2026 · May 28, 2026
diff --git a/docs/api/common.rst b/docs/api/common.rst
@@ -14,6 +14,8 @@ Common API
 
 .. autoapiclass:: transformer_engine.common.recipe.NVFP4BlockScaling(fp4_format=Format.E2M1)
 
+.. autoapiclass:: transformer_engine.common.recipe.NVFP4PerTokenBlockScaling(fp4_format=Format.E2M1)
+
 .. autoapiclass:: transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)
 
 .. autoapiclass:: transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)

diff --git a/docs/envvars.rst b/docs/envvars.rst
@@ -287,6 +287,48 @@ Kernel Configuration
    :Default: ``0``
    :Description: Enable row-scaled NVFP4 tensors for forward activation quantizers in the ``NVFP4BlockScaling`` recipe. When set to ``1`` (or when ``NVFP4BlockScaling(row_scaled_activation=True)`` is used), rowwise ``amax`` metadata is stored as one FP32 value per tensor row instead of a single scalar.
 
+.. envvar:: NVTE_NVFP4_DISABLE_RHT
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Opt out of the random Hadamard transform (RHT) in the per-tensor ``NVFP4BlockScaling`` recipe. RHT is applied by default to the forward activation and backward gradient quantizers. Set to ``1`` (or use ``NVFP4BlockScaling(disable_rht=True)``) to disable it. No effect on the per-token path (see :envvar:`NVTE_NVFP4_PER_TOKEN_RHT`).
+
+.. envvar:: NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Opt out of stochastic rounding (SR) in the per-tensor ``NVFP4BlockScaling`` recipe. SR is applied by default to the backward gradient quantizer. Set to ``1`` (or use ``NVFP4BlockScaling(disable_stochastic_rounding=True)``) to disable it. No effect on the per-token path (see :envvar:`NVTE_NVFP4_PER_TOKEN_SR`).
+
+.. envvar:: NVTE_NVFP4_DISABLE_2D_QUANTIZATION
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Opt out of 2D (16x16 inner tile + scalar outer amax) weight quantization in the per-tensor ``NVFP4BlockScaling`` recipe. 2D weight quantization is enabled by default. Set to ``1`` (or use ``NVFP4BlockScaling(disable_2d_quantization=True)``) to fall back to 1D (16-element block) weight quantization. Forced on the per-token path (the per-token cast hard-disables 2D); see :envvar:`NVTE_NVFP4_PER_TOKEN_WEIGHT_2D` for the per-token weight-2D route.
+
+.. envvar:: NVTE_NVFP4_PER_TOKEN
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Flip a plain ``NVFP4BlockScaling`` recipe into per-token mode (per-row / per-col outer ``amax`` cast plus the fused-EVT CUTLASS GEMM) without changing the recipe class. This lets frameworks that already construct a default ``NVFP4BlockScaling`` (e.g. Megatron-Core with ``--fp4-format e2m1``) opt into per-token purely from the launch environment. Equivalent to constructing the explicit ``NVFP4PerTokenBlockScaling`` recipe. The per-token forward path currently requires the unfused norm+amax path: also set ``NVTE_NORM_FWD_USE_CUDNN=1`` (the fused norm+amax path rejects per-token quantizers).
+
+.. envvar:: NVTE_NVFP4_PER_TOKEN_RHT
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Per-token only. Opt into the random Hadamard transform (RHT) on the per-token forward activation and backward gradient quantizers. Per-token disables RHT by default (its per-row outer amax already mitigates the long-tail outliers RHT targets); set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_rht=True)``) to re-enable it. No effect on the per-tensor path.
+
+.. envvar:: NVTE_NVFP4_PER_TOKEN_SR
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Per-token only. Opt into stochastic rounding (SR) on the per-token backward gradient quantizer (the K2 encode kernel implements a Philox-dithered FP4 cast). Per-token disables SR by default; set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_sr=True)``) to re-enable it. No effect on the per-tensor path.
+
+.. envvar:: NVTE_NVFP4_PER_TOKEN_WEIGHT_2D
+
+   :Type: ``int`` (0 or 1)
+   :Default: ``0``
+   :Description: Per-token only. Quantize the forward weight with the per-tensor 2D cast (16x16 inner tile + scalar outer amax) emitted in per-token layout, instead of the per-token 1D weight cast. 2D weight quantization is transposition-invariant, so forward (rowwise) and dgrad (columnwise) see the same weight, removing the 1D path's weight-gradient bias. Activations and gradients stay on the standard per-token 1D cast. Set to ``1`` (or use ``NVFP4PerTokenBlockScaling(per_token_weight_2d=True)``). No effect on the per-tensor path.
+
 .. envvar:: NVTE_NVFP4_4OVER6
 
    :Type: ``str`` (``none``, ``weights``, ``activations``, or ``all``)

diff --git a/docs/features/low_precision_training/nvfp4/nvfp4.rst b/docs/features/low_precision_training/nvfp4/nvfp4.rst
@@ -207,6 +207,89 @@ NVFP4 all-gather is supported.
 
 *Figure 6. Quantization and all-gather flow for NVFP4 showing amax synchronization and hierarchical scaling.*
 
+Per-token NVFP4
+---------------
+
+The default ``NVFP4BlockScaling`` recipe computes a single per-tensor outer
+``amax`` (``s_global``) for each tensor. The **per-token** variant instead
+computes a per-row outer ``amax`` (length ``M``) for rowwise data and a per-col
+outer ``amax`` (length ``K``) for columnwise data, giving each token/row its own
+global scale. This finer outer-scale granularity can improve accuracy, and the
+per-token cast feeds a dedicated fused-EVT CUTLASS GEMM that consumes the vector
+outer ``amax`` directly (cuBLASLt cannot).
+
+There are two ways to select per-token, both equivalent:
+
+* **Explicit recipe class** ``NVFP4PerTokenBlockScaling`` (recommended for code
+  that constructs its own recipe).
+* **Environment variable** ``NVTE_NVFP4_PER_TOKEN=1`` on a plain
+  ``NVFP4BlockScaling``. This lets frameworks that only ever build a default
+  ``NVFP4BlockScaling`` (for example Megatron-Core) opt into per-token purely
+  from the launch environment, with no framework-side code change.
+
+.. code-block:: python
+
+   from transformer_engine.common.recipe import NVFP4PerTokenBlockScaling
+   import transformer_engine.pytorch as te
+
+   # RHT and SR are OFF by default on the per-token path; opt in as needed.
+   recipe = NVFP4PerTokenBlockScaling(per_token_rht=True, per_token_sr=True)
+   with te.fp8_autocast(enabled=True, fp8_recipe=recipe):
+       out = model(inp)
+
+**Differences from the per-tensor default**
+
+* RHT and stochastic rounding are **off by default** on the per-token path (the
+  per-row outer ``amax`` already mitigates the long-tail outliers RHT targets).
+  Opt in with ``per_token_rht=True`` / ``per_token_sr=True`` (env vars
+  :envvar:`NVTE_NVFP4_PER_TOKEN_RHT` / :envvar:`NVTE_NVFP4_PER_TOKEN_SR`).
+* 2D weight quantization is disabled by default. The per-token weight-2D route
+  (``per_token_weight_2d=True`` / :envvar:`NVTE_NVFP4_PER_TOKEN_WEIGHT_2D`)
+  quantizes the forward weight with the transposition-invariant 2D cast emitted
+  in per-token layout, removing the 1D weight-gradient bias.
+* ``row_scaled_activation`` and 4over6 are forced off (mutually exclusive with
+  the per-token amax layout).
+
+**Requirement: unfused norm forward.** The per-token forward path requires the
+unfused norm+amax implementation; the fused norm+amax path rejects per-token
+quantizers. When the first GEMM consumes a fused norm output (for example
+``LayerNormLinear``), also set ``NVTE_NORM_FWD_USE_CUDNN=1``.
+
+**Currently unsupported on the per-token path**: ``fuse_wgrad_accumulation=True``,
+forward/backward output quantization, and communication/bulk overlap.
+
+Running per-token NVFP4 with Megatron-Core
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Megatron-Core builds a plain ``NVFP4BlockScaling`` for ``--fp4-format e2m1`` and
+has no CLI for per-token, so per-token is selected entirely through TE
+environment variables. A minimal launch looks like:
+
+.. code-block:: bash
+
+   # Select NVFP4 per-token via TE env vars (read at recipe construction).
+   export NVTE_NVFP4_PER_TOKEN=1
+   export NVTE_NORM_FWD_USE_CUDNN=1          # required: unfused norm forward
+   # Optional per-token knobs:
+   # export NVTE_NVFP4_PER_TOKEN_RHT=1
+   # export NVTE_NVFP4_PER_TOKEN_SR=1
+   # export NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1
+
+   python pretrain_gpt.py \
+       --transformer-impl transformer_engine \
+       --fp4-format e2m1 \
+       --no-gradient-accumulation-fusion \
+       ...   # remaining model / data / optimizer args
+
+Notes:
+
+* ``--no-gradient-accumulation-fusion`` is required because the per-token kernel
+  does not yet support fused wgrad accumulation.
+* To keep the first/last transformer layers in BF16, use Megatron's
+  ``--first-last-layers-bf16 --num-layers-at-start-in-bf16 N
+  --num-layers-at-end-in-bf16 M`` flags (those layers simply skip the FP4
+  autocast; the recipe is unchanged).
+
 Examples
 --------
 

diff --git a/examples/pytorch/nvfp4_per_token_megatron/README.md b/examples/pytorch/nvfp4_per_token_megatron/README.md
@@ -0,0 +1,140 @@
+# NVFP4 per-token training with Megatron-Core
+
+This example shows how to train a small Mixture-of-Experts (MoE) model with
+the **NVFP4 per-token** quantization recipe on a single GPU using
+[Megatron-Core](https://github.com/NVIDIA/Megatron-LM), and how to compare it
+against the per-tensor NVFP4 recipe and an unquantized BF16 baseline.
+
+The same model / data / seed are used across all modes; only the GEMM precision
+changes, so the runs are directly comparable.
+
+## How per-token interacts with Megatron-Core
+
+Megatron-Core builds a plain `transformer_engine.common.recipe.NVFP4BlockScaling`
+for `--fp4-format e2m1` and has **no CLI flag for per-token**. Per-token is
+selected entirely through Transformer Engine environment variables, read when the
+recipe is constructed:
+
+| Variable | Effect |
+| --- | --- |
+| `NVTE_NVFP4_PER_TOKEN=1` | **Required**: Flip the recipe into per-token mode (per-row/per-col outer amax + fused CUTLASS GEMM) |
+| `NVTE_NORM_FWD_USE_CUDNN=1` | **Required** with per-token: forces the unfused norm forward (the fused norm+amax path rejects per-token currently) |
+| `NVTE_NVFP4_PER_TOKEN_RHT=1` | Opt into the random Hadamard transform (off by default) |
+| `NVTE_NVFP4_PER_TOKEN_SR=1` | Opt into stochastic rounding (off by default) |
+| `NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1` | Use the transposition-invariant 2D weight cast in per-token layout |
+
+For the per-tensor recipe, the analogous knobs are
+`NVTE_NVFP4_DISABLE_RHT`, `NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING`, and
+`NVTE_NVFP4_DISABLE_2D_QUANTIZATION`.
+
+See the
+[NVFP4 documentation](../../../docs/features/low_precision_training/nvfp4/nvfp4.rst)
+("Per-token NVFP4") and `docs/envvars.rst` for full details. Equivalently, code
+that constructs its own recipe can use the public
+`transformer_engine.common.recipe.NVFP4PerTokenBlockScaling` class instead of the
+env var.
+
+Keeping the first/last transformer layers in BF16 is a Megatron-Core CLI feature
+(`--first-last-layers-bf16 --num-layers-at-start-in-bf16 N
+--num-layers-at-end-in-bf16 M`); those layers simply skip the FP4 autocast. This
+is also supported with the per-token recipe.
+
+## Prerequisites
+
+- A Blackwell GPU (SM100+) — NVFP4 training requires it.
+- Transformer Engine built from this repository **with per-token support**
+  (`NVTE_CUDA_ARCHS=100a NVTE_BUILD_THREADS_PER_JOB=8 NVTE_FRAMEWORK=pytorch pip install -e . --no-build-isolation`).
+- A [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) checkout (provides
+  `pretrain_gpt.py` and the `megatron` package).
+- A tokenized dataset and tokenizer (the scripts default to an OLMo-1124 corpus
+  with the Moonlight-16B-A3B tokenizer).
+- For Weights & Biases logging: authenticate with `wandb login` or export
+  `WANDB_API_KEY` in your environment.
+
+## Files
+
+| File | Purpose |
+| --- | --- |
+| `run_moe_nvfp4_singlegpu.sh` | Core launcher. Run **inside** the container from a shell that can see `pretrain_gpt.py`. Takes one mode: `bf16`, `prod` (== `pertensor`), or `pertoken`. |
+| `sbatch_moe_nvfp4_singlegpu.sh` | Slurm wrapper: starts the container, (re)installs the editable TE build, then runs one or more variants (so far one GPU each). |
+| `submit_chain.sh` | Submit a chain of dependent Slurm jobs that auto-resume from the stable checkpoint dir. |
+
+## Quick start (standalone, inside a container)
+
+```bash
+# Point at your Megatron-LM checkout, data, and tokenizer.
+export MLM_DIR=/path/to/Megatron-LM
+export DATA_PATH=/path/to/datasets/your_data
+export TOKENIZER_MODEL=/path/to/tokenizers/Moonlight-16B-A3B
+export TRAIN_ITERS=2000
+
+bash run_moe_nvfp4_singlegpu.sh pertoken
+
+# Compare against per-tensor NVFP4 and BF16:
+bash run_moe_nvfp4_singlegpu.sh pertensor
+bash run_moe_nvfp4_singlegpu.sh bf16
+```
+
+To enable per-token RHT / SR / 2D-weight, export the knobs before launching:
+
+```bash
+export NVTE_NVFP4_PER_TOKEN_RHT=1
+export NVTE_NVFP4_PER_TOKEN_SR=1
+export NVTE_NVFP4_PER_TOKEN_WEIGHT_2D=1
+bash run_moe_nvfp4_singlegpu.sh pertoken
+```
+
+## Slurm
+
+Edit the **host-side config block** at the top of
+`sbatch_moe_nvfp4_singlegpu.sh` (Slurm account, container `IMAGE`, `HOST_MOUNT`,
+`TE_DIR`, and `HOST_LOG_DIR` / the `#SBATCH --output/--error` paths) for your
+cluster, then:
+
+```bash
+# One mode:
+sbatch sbatch_moe_nvfp4_singlegpu.sh pertoken
+
+# Up to 4 variants concurrently (one GPU each):
+sbatch sbatch_moe_nvfp4_singlegpu.sh "bf16,pertensor+rht+sr,pertoken"
+
+# Override knobs via --export:
+sbatch --export=ALL,TRAIN_ITERS=2000,SEED=42 sbatch_moe_nvfp4_singlegpu.sh pertoken
+```
+
+Spec syntax: `<mode>[+rht][+sr][+1d][+2d][+fb]` where `mode` is
+`bf16 | prod (== pertensor) | pertoken`. `+rht`/`+sr` turn those features on,
+`+1d` forces 1D weights (per-tensor only), `+2d` enables the per-token 2D-weight
+route, and `+fb` keeps the first/last layers in BF16.
+
+For runs that exceed one Slurm wall-clock window, chain dependent jobs that
+resume from the stable per-variant checkpoint dir:
+
+```bash
+CHAIN=3 bash submit_chain.sh \
+    --export=ALL,IMAGE=/path/to/te_pertoken.sqsh,SKIP_BUILD=1,TRAIN_ITERS=60000 \
+    sbatch_moe_nvfp4_singlegpu.sh pertoken
+```
+
+## Notes and current limitations
+
+The per-token recipe is currently intended for **accuracy evaluation and
+comparison** (per-token vs per-tensor vs BF16), **not** for optimized production
+deployment. Concretely:
+
+- **Requires `NVTE_NORM_FWD_USE_CUDNN=1`** (the unfused cuDNN norm forward).
+  The fused norm+amax path (`NVTE_NORM_FWD_USE_CUDNN=0`, the default) does **not**
+  support per-token and is rejected at the C++ quantizer. The launcher sets this
+  for you in `pertoken` mode.
+- **Not tested with CUDA graphs.** The per-token path has not been validated under
+  Megatron's CUDA graph capture; leave CUDA graphs disabled for now.
+- **Kernels are not yet performance-optimal.** Several per-token cast / GEMM
+  kernels are functional but not tuned, so wall-clock throughput is not
+  representative of the recipe's eventual performance. Use this example for
+  numerical/accuracy comparison, not perf benchmarking.
+- `--no-gradient-accumulation-fusion` is required: the per-token kernel does not
+  yet support fused wgrad accumulation. The scripts set it for every mode so
+  only the GEMM precision differs.
+- The example reduces the MoE expert count to 64 so all experts stay local at
+  EP=1 on a single GPU (TE's grouped-NVFP4 kernels cap at 64 tensors per launch).
+  Real training shards experts via EP>1.