[Quantization] Shrink FP8 sweep parity matrix from 27 to 12 cases

cjluo-nv · cjluo-nv · commit 2bc8a54ed85e · 2026-05-08T05:15:11.000Z
Trim the parity grid to keep all three axes but with smaller per-axis
ranges: 2 seeds × 2 num_blocks × 3 dtypes = 12 parametrized cases (down
from 3×3×3 = 27). Still exercises every supported dtype and the small/
large num_blocks extremes that drive different autotune choices, while
roughly halving the cold-compile cost on hosts where Triton compilation
is expensive.

Signed-off-by: Chenjie Luo &lt;chenjiel@nvidia.com&gt;
diff --git a/tests/gpu/torch/quantization/test_nvfp4_fp8_sweep_kernel.py b/tests/gpu/torch/quantization/test_nvfp4_fp8_sweep_kernel.py
@@ -86,8 +86,8 @@ def _run_triton(x, per_block_amax, global_amax):
 
 @requires_triton
 @pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16])
-@pytest.mark.parametrize("seed", [0, 1, 2])
-@pytest.mark.parametrize("num_blocks", [4, 64, 1024])
+@pytest.mark.parametrize("num_blocks", [4, 1024])
+@pytest.mark.parametrize("seed", [0, 1])
 def test_parity_random_weights(seed, num_blocks, dtype):
     """Triton sweep must produce the exact same per-block amax as the reference,
     across every dtype supported by the NVFP4 quantizer (fp32, fp16, bf16)."""