[PyTorch] Refactor grouped linear and grouped MLP tests by timmoon10 · Pull Request #3111 · NVIDIA/TransformerEngine

timmoon10 · 2026-06-10T04:49:36Z

Description

test_fusible_ops.py was becoming a dumping ground for random grouped MLP tests, including tests that didn't involve fusible ops at all. This PR reorganizes the tests so that test_fusible_ops.py holds basic tests for te.ops.GroupedLinear, while test_grouped_linear.py holds the exhaustive tests for all the various grouped MLP fused ops. I've also tried trimming down excessive test parametrization to bring down the test time from ~20 min to ~1 min.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring
Testing

Changes

Move exhaustive grouped linear and grouped MLP tests from test_fusible_ops.py to test_grouped_linear.py.
Reorganize test_grouped_linear.py into test suites.
Add util functions.
Reduce redundant test cases.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…st_grouped_linear test_fusible_ops.py is the general op-fuser test suite; detailed tests for grouped-linear-specific features belong in test_grouped_linear.py. Moved to test_grouped_linear.py: - test_grouped_linear_cuda_graph_safe (CUDA graph capture) - test_grouped_mlp_single_weight_numerics (single_grouped_weight equivalence) - test_grouped_mlp_overwrite_main_grad (MegatronFSDP overwrite convention) - test_grouped_mlp_cuda_graph_safe_mxfp8 (CUDA graph + MXFP8) Kept in test_fusible_ops.py with reduced parametrization: - test_grouped_linear: dropped single_grouped_weight/bias and delay_wgrad_compute axes (hardcoded to False) - test_grouped_mlp: dropped single_grouped_weight/bias, accumulate_into_main_grad, and delay_wgrad_compute axes; reduced hidden_size to a single value Also adds NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 to the test_grouped_linear.py invocation in the QA script, matching the env vars already set for test_fusible_ops.py and required by the moved tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

- Simplify test_grouped_mlp to a focused sanity test: - Hardcode ScaledSwiGLU (drop 4-way activation + glu_interleave_size axes) - Drop single_grouped_weight/bias, accumulate_into_main_grad, delay_wgrad_compute branches (all were defaulted False and never parametrized) - Remove fused-op dispatch assertions (ForwardGroupedMLP_CuTeGEMMGLU etc.) that required NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 - Switch quantization list from _grouped_mlp_quantization_list to _quantization_list (drops nvfp4_rht which was always skipped for SwiGLU) - Clean up test_grouped_linear: - Remove env-var skip for single_grouped_weight/bias (dead code: those params are not parametrized here, so the condition never triggered) - Switch assertions from torch.testing.assert_close with manual .to(float64) to assert_close / assert_close_grads utilities - Remove now-unused imports: _cudnn_frontend_*, is_glu_activation, MegatronTrainingHelper, _grouped_mlp_quantization_list - QA script: drop NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 from test_fusible_ops.py invocation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

The test validates that grouped_gemm_quant_wrapper_sm100 (a cuTE DSL kernel internal to the grouped MLP fusion) matches MXFP8 quantizer output. It does not exercise the op-fuser infrastructure at all, so it belongs in test_grouped_linear.py alongside the other grouped-MLP- specific tests. Also removes the now-unused `import transformer_engine_torch as tex` from test_fusible_ops.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

test_grouped_linear.py only tested te.GroupedLinear (the high-level module API). The test_grouped_linear test in test_fusible_ops.py covers te.ops.GroupedLinear (the fuser/ops API) — a different thing. This commit brings that coverage into test_grouped_linear.py. Additions: - maybe_skip_quantization: skip helper for hardware/dim/dtype checks - make_reference_and_test_tensors: paired float64/CPU reference and target-dtype/CUDA test tensor construction (with quantization) - _ops_quantization_list: quantization parameter list for ops tests - test_ops_grouped_linear: full port of test_grouped_linear from test_fusible_ops.py, including the three axes that were stripped during cleanup (delay_wgrad_compute, single_grouped_weight, single_grouped_bias); uses te.ops.GroupedLinear throughout Also adds Float8CurrentScalingQuantizer, QuantizedTensor, QuantizerRole to imports, and import math. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Structural changes: - Organize into three labeled sections: te.GroupedLinear (module API), raw GEMM kernels (cpp_extensions), te.ops.GroupedLinear (ops/fuser API) - Move fused-path helpers and tests (_reset_fp8_state, _run_grouped_linear_path, test_grouped_linear_grouped_tensor_path_*, test_grouped_linear_fused_path_*) before the GEMM section so all te.GroupedLinear tests are contiguous Fold redundant tests: - test_grouped_linear_accuracy_single_gemm → num_gemms=[1,3,6] - test_padding_grouped_linear_accuracy_save_original_input → save_original_input parametrize axis on test_padding_grouped_linear_accuracy - test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad → single_grouped_bias parametrize axis on test_grouped_linear_grouped_tensor_path_matches_legacy Fix test_grouped_mlp reference precision and tolerances: - Replace float32 reference tensors with make_reference_and_test_tensors (float64 CPU reference, target dtype CUDA test) - Replace loose rtol=0.125/0.25 with dtype_tols()/quantization_tols() Promote shared helpers to utils.py: - maybe_skip_quantization: was duplicated in test_fusible_ops.py and test_grouped_linear.py; now in utils.py - make_reference_and_test_tensors: same - dtype_tols: was duplicated locally in test_grouped_linear.py; now imported from utils.py (where it already existed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Group the four logical sections into test classes so structure is enforced by Python rather than relying on comments: - TestGroupedLinearModule: te.GroupedLinear module API tests - TestGroupedGemm: raw grouped GEMM kernel (cpp_extensions) tests - TestOpsGroupedLinear: te.ops.GroupedLinear tests - TestGroupedMLP: grouped MLP pattern tests Each class carries autouse fixtures for the environment variables it needs (NVTE_GROUPED_LINEAR_USE_FUSED_GROUPED_GEMM, NVTE_GROUPED_LINEAR_SINGLE_PARAM, NVTE_CUTEDSL_FUSED_GROUPED_MLP), replacing runtime os.environ skip checks. Also fix three coverage/correctness issues flagged in review: - _ALL_BOOLEAN / _mxfp8_available were aliases defined after the class that used them in decorators, causing NameError at collection time; replaced with the originals (all_boolean, mxfp8_available, etc.) - test_grouped_mlp was missing hidden_size=256 from parametrization - test_grouped_mlp weight tensors were missing quantizer_role="weight", which matters for NVFP4 RHT quantization behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Pull repeated grouped-linear test setup into shared helpers for environment toggles, split-size construction, grouped parameter copying, and grad collection. This keeps the module, ops, and grouped-MLP coverage aligned around the same single-grouped-parameter conventions instead of duplicating local ad hoc loops. Also make the grouped tensor path comparison look at the actual single grouped bias parameter when that mode is enabled, matching the idiom already used by the fusible ops tests. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Test was inadvertently disabled, and enabling triggered test failures beyond the scope of this PR. Grouped tensor params are still highly experimental, and it was quite strange to only test grouped bias without grouped weights. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Prioritize configs that trigger op fusion. Separate parametrized cases for advanced Mcore integrations. Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-10T04:57:37Z

Greptile Summary

This PR reorganizes grouped linear and grouped MLP tests: exhaustive tests are moved from test_fusible_ops.py into test_grouped_linear.py, which is restructured into four focused test classes (TestGroupedLinearModule, TestGroupedGemm, TestGroupedLinearOps, TestGroupedMLP). Helper utilities (make_reference_and_test_tensors, maybe_skip_quantization) are promoted to shared utils.py.

Test consolidation: test_fusible_ops.py is slimmed down to basic te.ops.GroupedLinear smoke tests; all exhaustive grouped MLP variants (quantization, activation, CUDA graphs, Megatron integrations) live in test_grouped_linear.py.
Env-var management: NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 are now set per-class via monkeypatch autouse fixtures instead of being global to the test.sh invocation, giving better test isolation.
Reduced parametrization: redundant cross-product test cases are trimmed to bring CI time down from ~20 min to ~1 min.

Confidence Score: 5/5

Pure test refactoring with no production code changes; all coverage is preserved and correctly reorganized.

The changes are limited to test files and the CI shell script. All previously tested configurations remain covered — either directly in the reorganized classes or merged via new parametrization (e.g., save_original_input folded into test_padding_grouped_linear_accuracy). The only minor issue found is an unused is_glu_activation import introduced during the move.

tests/pytorch/test_grouped_linear.py has a dead import (is_glu_activation) that can be dropped.

Important Files Changed

Filename	Overview
tests/pytorch/test_grouped_linear.py	Exhaustive grouped linear/MLP tests reorganized into four classes with env-var management moved from module-level fixtures to per-class monkeypatch fixtures; one unused import (is_glu_activation) was introduced.
tests/pytorch/utils.py	Extracts make_reference_and_test_tensors and maybe_skip_quantization from test_fusible_ops.py into shared utils; adds quantizer availability globals and a BF16 hardware availability guard.
tests/pytorch/test_fusible_ops.py	Removes exhaustive grouped MLP tests and local helper functions now in test_grouped_linear.py/utils.py; keeps a slim smoke-test for te.ops.GroupedLinear basics.
qa/L0_pytorch_unittest/test.sh	Drops global NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 from the test_fusible_ops.py invocation; those env vars are now managed per-class via monkeypatch fixtures.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Before
        A[test_fusible_ops.py\nGroupedLinear + GroupedMLP tests\n+ helper functions]
        B[test_grouped_linear.py\nModule-level tests\n_reset_fp8_state autouse fixture]
        C[utils.py\nNo quantization helpers]
    end
    subgraph After
        D[test_fusible_ops.py\nBasic GroupedLinear smoke tests only]
        E[test_grouped_linear.py\nTestGroupedLinearModule\nTestGroupedGemm\nTestGroupedLinearOps\nTestGroupedMLP]
        F[utils.py\nmaybe_skip_quantization\nmake_reference_and_test_tensors\nQuantizer availability globals]
    end
    A -->|move exhaustive tests| E
    B -->|reorganize into classes| E
    A -->|extract helpers| F
    C -->|add helpers| F
    A -->|keep slim| D

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile}

greptile-apps · 2026-06-10T04:57:40Z

+    @pytest.mark.parametrize("quantization", _quantization_list)
+    @pytest.mark.parametrize("quantized_compute", (False, True))
+    @pytest.mark.parametrize("quantized_weight", (False, True))
+    @pytest.mark.parametrize("input_requires_grad", (False, True))
+    @pytest.mark.parametrize("weight_requires_grad", (False, True))
+    def test_grouped_linear(
+        self,
+        *,
+        group_size: int = 4,
+        bias: bool,
+        weight_shape: tuple = (128, 128),
+        split_alignment: int = 128,
+        dtype: torch.dtype,
+        device: torch.device = "cuda",
+        quantization: Optional[str],
+        quantized_compute: bool,
+        quantized_weight: bool,
+        input_requires_grad: bool,
+        weight_requires_grad: bool,
+        delay_wgrad_compute: bool,
+        single_grouped_weight: bool,
+        single_grouped_bias: bool,
+    ) -> None:
+        """te.ops.GroupedLinear forward+backward accuracy"""

+        # Split sizes
+        split_sizes = _make_grouped_split_sizes(
+            group_size,
+            split_alignment,
+            dtype=torch.int,
+            device=device,
+        )

-def test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad(monkeypatch):
-    if torch.cuda.get_device_capability() < (10, 0):
-        pytest.skip("GroupedTensor grouped GEMM path requires SM100+")
+        # Make input and weight shapes consistent
+        out_features, in_features = weight_shape
+        in_shape = (split_sizes.sum().item(), in_features)
+        out_shape = (in_shape[0], out_features)
+
+        # Skip invalid configurations
+        maybe_skip_quantization(quantization, dims=in_shape, device=device, dtype=dtype)
+        maybe_skip_quantization(quantization, dims=out_shape)
+        if quantization is None and (quantized_compute or quantized_weight):
+            pytest.skip("Quantization scheme is not specified")
+        if quantization is not None and not (quantized_compute or quantized_weight):
+            pytest.skip("Quantization scheme is not used")
+        if quantization is not None and dtype not in (torch.bfloat16, torch.float16):
+            pytest.skip("Quantized group GEMM is only supported with BF16/FP16")
+        if single_grouped_bias and not bias:
+            pytest.skip("single_grouped_bias requires bias=True")
+        if (
+            single_grouped_weight
+            and quantized_weight
+            and quantization in ("fp8_delayed_scaling", "fp8_current_scaling")
+        ):
+            pytest.skip(
+                "single_grouped_weight does not support FP8 delayed/current scaling "
+                "with quantized_model_init"
+            )


Missing nvfp4_4over6 skip in TestGroupedLinearOps.test_grouped_linear

The old test_fusible_ops.py::TestBasicOps::test_grouped_linear explicitly skipped nvfp4_4over6 with the message "NVFP4 4over6 grouped quantization is not supported". This skip was dropped when the test was moved here. Meanwhile, the skip is still present in _skip_invalid_grouped_mlp_case (line 540) for the grouped MLP tests. If this limitation still applies to the basic te.ops.GroupedLinear path, then on systems with NVFP4 support this test will fail at runtime rather than being skipped.

It turns out the grouped linear op does actually support NVFP4 4over6, so there's no harm in including it in tests. However, the grouped MLP fused op definitely does not support NVFP4 4over6 and test_grouped_mlp is focused on cases that involve the fused op.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2026-06-10T05:13:15Z

/te-ci pytorch

timmoon10 and others added 9 commits June 8, 2026 22:01

Reduce grouped MLP test cases

4368cb9

Prioritize configs that trigger op fusion. Separate parametrized cases for advanced Mcore integrations. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested review from ksivaman and vthumbe1503 June 10, 2026 04:49

timmoon10 added testing Improvements to tests or testing infrastructure refactor MoE labels Jun 10, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

168a9ef

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

timmoon10 added 2 commits June 10, 2026 05:11

Review suggestion from @greptile-apps

90586fb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/refactor-grouped-mlp-tests

52ad683

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Refactor grouped linear and grouped MLP tests#3111

[PyTorch] Refactor grouped linear and grouped MLP tests#3111
timmoon10 wants to merge 12 commits into
NVIDIA:mainfrom
timmoon10:tmoon/refactor-grouped-mlp-tests

timmoon10 commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 10, 2026

Uh oh!

timmoon10 Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

timmoon10 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timmoon10 commented Jun 10, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

timmoon10 Jun 10, 2026 •

edited

Loading