Skip to content

[PyTorch] Refactor grouped linear and grouped MLP tests#3111

Open
timmoon10 wants to merge 12 commits into
NVIDIA:mainfrom
timmoon10:tmoon/refactor-grouped-mlp-tests
Open

[PyTorch] Refactor grouped linear and grouped MLP tests#3111
timmoon10 wants to merge 12 commits into
NVIDIA:mainfrom
timmoon10:tmoon/refactor-grouped-mlp-tests

Conversation

@timmoon10

Copy link
Copy Markdown
Member

Description

test_fusible_ops.py was becoming a dumping ground for random grouped MLP tests, including tests that didn't involve fusible ops at all. This PR reorganizes the tests so that test_fusible_ops.py holds basic tests for te.ops.GroupedLinear, while test_grouped_linear.py holds the exhaustive tests for all the various grouped MLP fused ops. I've also tried trimming down excessive test parametrization to bring down the test time from ~20 min to ~1 min.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring
  • Testing

Changes

  • Move exhaustive grouped linear and grouped MLP tests from test_fusible_ops.py to test_grouped_linear.py.
  • Reorganize test_grouped_linear.py into test suites.
  • Add util functions.
  • Reduce redundant test cases.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

timmoon10 and others added 9 commits June 8, 2026 22:01
…st_grouped_linear

test_fusible_ops.py is the general op-fuser test suite; detailed tests
for grouped-linear-specific features belong in test_grouped_linear.py.

Moved to test_grouped_linear.py:
- test_grouped_linear_cuda_graph_safe (CUDA graph capture)
- test_grouped_mlp_single_weight_numerics (single_grouped_weight equivalence)
- test_grouped_mlp_overwrite_main_grad (MegatronFSDP overwrite convention)
- test_grouped_mlp_cuda_graph_safe_mxfp8 (CUDA graph + MXFP8)

Kept in test_fusible_ops.py with reduced parametrization:
- test_grouped_linear: dropped single_grouped_weight/bias and
  delay_wgrad_compute axes (hardcoded to False)
- test_grouped_mlp: dropped single_grouped_weight/bias,
  accumulate_into_main_grad, and delay_wgrad_compute axes; reduced
  hidden_size to a single value

Also adds NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and
NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 to the test_grouped_linear.py
invocation in the QA script, matching the env vars already set for
test_fusible_ops.py and required by the moved tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Simplify test_grouped_mlp to a focused sanity test:
  - Hardcode ScaledSwiGLU (drop 4-way activation + glu_interleave_size axes)
  - Drop single_grouped_weight/bias, accumulate_into_main_grad, delay_wgrad_compute
    branches (all were defaulted False and never parametrized)
  - Remove fused-op dispatch assertions (ForwardGroupedMLP_CuTeGEMMGLU etc.)
    that required NVTE_CUTEDSL_FUSED_GROUPED_MLP=1
  - Switch quantization list from _grouped_mlp_quantization_list to
    _quantization_list (drops nvfp4_rht which was always skipped for SwiGLU)
- Clean up test_grouped_linear:
  - Remove env-var skip for single_grouped_weight/bias (dead code: those params
    are not parametrized here, so the condition never triggered)
  - Switch assertions from torch.testing.assert_close with manual .to(float64)
    to assert_close / assert_close_grads utilities
- Remove now-unused imports: _cudnn_frontend_*, is_glu_activation,
  MegatronTrainingHelper, _grouped_mlp_quantization_list
- QA script: drop NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and
  NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 from test_fusible_ops.py invocation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
The test validates that grouped_gemm_quant_wrapper_sm100 (a cuTE DSL
kernel internal to the grouped MLP fusion) matches MXFP8 quantizer
output. It does not exercise the op-fuser infrastructure at all, so
it belongs in test_grouped_linear.py alongside the other grouped-MLP-
specific tests.

Also removes the now-unused `import transformer_engine_torch as tex`
from test_fusible_ops.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
test_grouped_linear.py only tested te.GroupedLinear (the high-level
module API). The test_grouped_linear test in test_fusible_ops.py
covers te.ops.GroupedLinear (the fuser/ops API) — a different thing.
This commit brings that coverage into test_grouped_linear.py.

Additions:
- maybe_skip_quantization: skip helper for hardware/dim/dtype checks
- make_reference_and_test_tensors: paired float64/CPU reference and
  target-dtype/CUDA test tensor construction (with quantization)
- _ops_quantization_list: quantization parameter list for ops tests
- test_ops_grouped_linear: full port of test_grouped_linear from
  test_fusible_ops.py, including the three axes that were stripped
  during cleanup (delay_wgrad_compute, single_grouped_weight,
  single_grouped_bias); uses te.ops.GroupedLinear throughout

Also adds Float8CurrentScalingQuantizer, QuantizedTensor, QuantizerRole
to imports, and import math.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Structural changes:
- Organize into three labeled sections: te.GroupedLinear (module API),
  raw GEMM kernels (cpp_extensions), te.ops.GroupedLinear (ops/fuser API)
- Move fused-path helpers and tests (_reset_fp8_state, _run_grouped_linear_path,
  test_grouped_linear_grouped_tensor_path_*, test_grouped_linear_fused_path_*)
  before the GEMM section so all te.GroupedLinear tests are contiguous

Fold redundant tests:
- test_grouped_linear_accuracy_single_gemm → num_gemms=[1,3,6]
- test_padding_grouped_linear_accuracy_save_original_input →
  save_original_input parametrize axis on test_padding_grouped_linear_accuracy
- test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad →
  single_grouped_bias parametrize axis on test_grouped_linear_grouped_tensor_path_matches_legacy

Fix test_grouped_mlp reference precision and tolerances:
- Replace float32 reference tensors with make_reference_and_test_tensors
  (float64 CPU reference, target dtype CUDA test)
- Replace loose rtol=0.125/0.25 with dtype_tols()/quantization_tols()

Promote shared helpers to utils.py:
- maybe_skip_quantization: was duplicated in test_fusible_ops.py and
  test_grouped_linear.py; now in utils.py
- make_reference_and_test_tensors: same
- dtype_tols: was duplicated locally in test_grouped_linear.py; now
  imported from utils.py (where it already existed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Group the four logical sections into test classes so structure is
enforced by Python rather than relying on comments:
- TestGroupedLinearModule: te.GroupedLinear module API tests
- TestGroupedGemm: raw grouped GEMM kernel (cpp_extensions) tests
- TestOpsGroupedLinear: te.ops.GroupedLinear tests
- TestGroupedMLP: grouped MLP pattern tests

Each class carries autouse fixtures for the environment variables it
needs (NVTE_GROUPED_LINEAR_USE_FUSED_GROUPED_GEMM,
NVTE_GROUPED_LINEAR_SINGLE_PARAM, NVTE_CUTEDSL_FUSED_GROUPED_MLP),
replacing runtime os.environ skip checks.

Also fix three coverage/correctness issues flagged in review:
- _ALL_BOOLEAN / _mxfp8_available were aliases defined after the class
  that used them in decorators, causing NameError at collection time;
  replaced with the originals (all_boolean, mxfp8_available, etc.)
- test_grouped_mlp was missing hidden_size=256 from parametrization
- test_grouped_mlp weight tensors were missing quantizer_role="weight",
  which matters for NVFP4 RHT quantization behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Pull repeated grouped-linear test setup into shared helpers for environment toggles, split-size construction, grouped parameter copying, and grad collection. This keeps the module, ops, and grouped-MLP coverage aligned around the same single-grouped-parameter conventions instead of duplicating local ad hoc loops.

Also make the grouped tensor path comparison look at the actual single grouped bias parameter when that mode is enabled, matching the idiom already used by the fusible ops tests.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Test was inadvertently disabled, and enabling triggered test failures beyond the scope of this PR. Grouped tensor params are still highly experimental, and it was quite strange to only test grouped bias without grouped weights.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Prioritize configs that trigger op fusion. Separate parametrized cases for advanced Mcore integrations.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 added testing Improvements to tests or testing infrastructure refactor MoE labels Jun 10, 2026
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR reorganizes grouped linear and grouped MLP tests: exhaustive tests are moved from test_fusible_ops.py into test_grouped_linear.py, which is restructured into four focused test classes (TestGroupedLinearModule, TestGroupedGemm, TestGroupedLinearOps, TestGroupedMLP). Helper utilities (make_reference_and_test_tensors, maybe_skip_quantization) are promoted to shared utils.py.

  • Test consolidation: test_fusible_ops.py is slimmed down to basic te.ops.GroupedLinear smoke tests; all exhaustive grouped MLP variants (quantization, activation, CUDA graphs, Megatron integrations) live in test_grouped_linear.py.
  • Env-var management: NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 are now set per-class via monkeypatch autouse fixtures instead of being global to the test.sh invocation, giving better test isolation.
  • Reduced parametrization: redundant cross-product test cases are trimmed to bring CI time down from ~20 min to ~1 min.

Confidence Score: 5/5

Pure test refactoring with no production code changes; all coverage is preserved and correctly reorganized.

The changes are limited to test files and the CI shell script. All previously tested configurations remain covered — either directly in the reorganized classes or merged via new parametrization (e.g., save_original_input folded into test_padding_grouped_linear_accuracy). The only minor issue found is an unused is_glu_activation import introduced during the move.

tests/pytorch/test_grouped_linear.py has a dead import (is_glu_activation) that can be dropped.

Important Files Changed

Filename Overview
tests/pytorch/test_grouped_linear.py Exhaustive grouped linear/MLP tests reorganized into four classes with env-var management moved from module-level fixtures to per-class monkeypatch fixtures; one unused import (is_glu_activation) was introduced.
tests/pytorch/utils.py Extracts make_reference_and_test_tensors and maybe_skip_quantization from test_fusible_ops.py into shared utils; adds quantizer availability globals and a BF16 hardware availability guard.
tests/pytorch/test_fusible_ops.py Removes exhaustive grouped MLP tests and local helper functions now in test_grouped_linear.py/utils.py; keeps a slim smoke-test for te.ops.GroupedLinear basics.
qa/L0_pytorch_unittest/test.sh Drops global NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 from the test_fusible_ops.py invocation; those env vars are now managed per-class via monkeypatch fixtures.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Before
        A[test_fusible_ops.py\nGroupedLinear + GroupedMLP tests\n+ helper functions]
        B[test_grouped_linear.py\nModule-level tests\n_reset_fp8_state autouse fixture]
        C[utils.py\nNo quantization helpers]
    end
    subgraph After
        D[test_fusible_ops.py\nBasic GroupedLinear smoke tests only]
        E[test_grouped_linear.py\nTestGroupedLinearModule\nTestGroupedGemm\nTestGroupedLinearOps\nTestGroupedMLP]
        F[utils.py\nmaybe_skip_quantization\nmake_reference_and_test_tensors\nQuantizer availability globals]
    end
    A -->|move exhaustive tests| E
    B -->|reorganize into classes| E
    A -->|extract helpers| F
    C -->|add helpers| F
    A -->|keep slim| D
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile

Comment on lines +2213 to +2270
@pytest.mark.parametrize("quantization", _quantization_list)
@pytest.mark.parametrize("quantized_compute", (False, True))
@pytest.mark.parametrize("quantized_weight", (False, True))
@pytest.mark.parametrize("input_requires_grad", (False, True))
@pytest.mark.parametrize("weight_requires_grad", (False, True))
def test_grouped_linear(
self,
*,
group_size: int = 4,
bias: bool,
weight_shape: tuple = (128, 128),
split_alignment: int = 128,
dtype: torch.dtype,
device: torch.device = "cuda",
quantization: Optional[str],
quantized_compute: bool,
quantized_weight: bool,
input_requires_grad: bool,
weight_requires_grad: bool,
delay_wgrad_compute: bool,
single_grouped_weight: bool,
single_grouped_bias: bool,
) -> None:
"""te.ops.GroupedLinear forward+backward accuracy"""

# Split sizes
split_sizes = _make_grouped_split_sizes(
group_size,
split_alignment,
dtype=torch.int,
device=device,
)

def test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad(monkeypatch):
if torch.cuda.get_device_capability() < (10, 0):
pytest.skip("GroupedTensor grouped GEMM path requires SM100+")
# Make input and weight shapes consistent
out_features, in_features = weight_shape
in_shape = (split_sizes.sum().item(), in_features)
out_shape = (in_shape[0], out_features)

# Skip invalid configurations
maybe_skip_quantization(quantization, dims=in_shape, device=device, dtype=dtype)
maybe_skip_quantization(quantization, dims=out_shape)
if quantization is None and (quantized_compute or quantized_weight):
pytest.skip("Quantization scheme is not specified")
if quantization is not None and not (quantized_compute or quantized_weight):
pytest.skip("Quantization scheme is not used")
if quantization is not None and dtype not in (torch.bfloat16, torch.float16):
pytest.skip("Quantized group GEMM is only supported with BF16/FP16")
if single_grouped_bias and not bias:
pytest.skip("single_grouped_bias requires bias=True")
if (
single_grouped_weight
and quantized_weight
and quantization in ("fp8_delayed_scaling", "fp8_current_scaling")
):
pytest.skip(
"single_grouped_weight does not support FP8 delayed/current scaling "
"with quantized_model_init"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing nvfp4_4over6 skip in TestGroupedLinearOps.test_grouped_linear

The old test_fusible_ops.py::TestBasicOps::test_grouped_linear explicitly skipped nvfp4_4over6 with the message "NVFP4 4over6 grouped quantization is not supported". This skip was dropped when the test was moved here. Meanwhile, the skip is still present in _skip_invalid_grouped_mlp_case (line 540) for the grouped MLP tests. If this limitation still applies to the basic te.ops.GroupedLinear path, then on systems with NVFP4 support this test will fail at runtime rather than being skipped.

@timmoon10 timmoon10 Jun 10, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the grouped linear op does actually support NVFP4 4over6, so there's no harm in including it in tests. However, the grouped MLP fused op definitely does not support NVFP4 4over6 and test_grouped_mlp is focused on cases that involve the fused op.

Comment thread tests/pytorch/utils.py
@timmoon10

Copy link
Copy Markdown
Member Author

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MoE refactor testing Improvements to tests or testing infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant