[PyTorch] Refactor grouped linear and grouped MLP tests#3111
[PyTorch] Refactor grouped linear and grouped MLP tests#3111timmoon10 wants to merge 12 commits into
Conversation
…st_grouped_linear test_fusible_ops.py is the general op-fuser test suite; detailed tests for grouped-linear-specific features belong in test_grouped_linear.py. Moved to test_grouped_linear.py: - test_grouped_linear_cuda_graph_safe (CUDA graph capture) - test_grouped_mlp_single_weight_numerics (single_grouped_weight equivalence) - test_grouped_mlp_overwrite_main_grad (MegatronFSDP overwrite convention) - test_grouped_mlp_cuda_graph_safe_mxfp8 (CUDA graph + MXFP8) Kept in test_fusible_ops.py with reduced parametrization: - test_grouped_linear: dropped single_grouped_weight/bias and delay_wgrad_compute axes (hardcoded to False) - test_grouped_mlp: dropped single_grouped_weight/bias, accumulate_into_main_grad, and delay_wgrad_compute axes; reduced hidden_size to a single value Also adds NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 to the test_grouped_linear.py invocation in the QA script, matching the env vars already set for test_fusible_ops.py and required by the moved tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Simplify test_grouped_mlp to a focused sanity test:
- Hardcode ScaledSwiGLU (drop 4-way activation + glu_interleave_size axes)
- Drop single_grouped_weight/bias, accumulate_into_main_grad, delay_wgrad_compute
branches (all were defaulted False and never parametrized)
- Remove fused-op dispatch assertions (ForwardGroupedMLP_CuTeGEMMGLU etc.)
that required NVTE_CUTEDSL_FUSED_GROUPED_MLP=1
- Switch quantization list from _grouped_mlp_quantization_list to
_quantization_list (drops nvfp4_rht which was always skipped for SwiGLU)
- Clean up test_grouped_linear:
- Remove env-var skip for single_grouped_weight/bias (dead code: those params
are not parametrized here, so the condition never triggered)
- Switch assertions from torch.testing.assert_close with manual .to(float64)
to assert_close / assert_close_grads utilities
- Remove now-unused imports: _cudnn_frontend_*, is_glu_activation,
MegatronTrainingHelper, _grouped_mlp_quantization_list
- QA script: drop NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 and
NVTE_CUTEDSL_FUSED_GROUPED_MLP=1 from test_fusible_ops.py invocation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
The test validates that grouped_gemm_quant_wrapper_sm100 (a cuTE DSL kernel internal to the grouped MLP fusion) matches MXFP8 quantizer output. It does not exercise the op-fuser infrastructure at all, so it belongs in test_grouped_linear.py alongside the other grouped-MLP- specific tests. Also removes the now-unused `import transformer_engine_torch as tex` from test_fusible_ops.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
test_grouped_linear.py only tested te.GroupedLinear (the high-level module API). The test_grouped_linear test in test_fusible_ops.py covers te.ops.GroupedLinear (the fuser/ops API) — a different thing. This commit brings that coverage into test_grouped_linear.py. Additions: - maybe_skip_quantization: skip helper for hardware/dim/dtype checks - make_reference_and_test_tensors: paired float64/CPU reference and target-dtype/CUDA test tensor construction (with quantization) - _ops_quantization_list: quantization parameter list for ops tests - test_ops_grouped_linear: full port of test_grouped_linear from test_fusible_ops.py, including the three axes that were stripped during cleanup (delay_wgrad_compute, single_grouped_weight, single_grouped_bias); uses te.ops.GroupedLinear throughout Also adds Float8CurrentScalingQuantizer, QuantizedTensor, QuantizerRole to imports, and import math. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Structural changes: - Organize into three labeled sections: te.GroupedLinear (module API), raw GEMM kernels (cpp_extensions), te.ops.GroupedLinear (ops/fuser API) - Move fused-path helpers and tests (_reset_fp8_state, _run_grouped_linear_path, test_grouped_linear_grouped_tensor_path_*, test_grouped_linear_fused_path_*) before the GEMM section so all te.GroupedLinear tests are contiguous Fold redundant tests: - test_grouped_linear_accuracy_single_gemm → num_gemms=[1,3,6] - test_padding_grouped_linear_accuracy_save_original_input → save_original_input parametrize axis on test_padding_grouped_linear_accuracy - test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad → single_grouped_bias parametrize axis on test_grouped_linear_grouped_tensor_path_matches_legacy Fix test_grouped_mlp reference precision and tolerances: - Replace float32 reference tensors with make_reference_and_test_tensors (float64 CPU reference, target dtype CUDA test) - Replace loose rtol=0.125/0.25 with dtype_tols()/quantization_tols() Promote shared helpers to utils.py: - maybe_skip_quantization: was duplicated in test_fusible_ops.py and test_grouped_linear.py; now in utils.py - make_reference_and_test_tensors: same - dtype_tols: was duplicated locally in test_grouped_linear.py; now imported from utils.py (where it already existed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Group the four logical sections into test classes so structure is enforced by Python rather than relying on comments: - TestGroupedLinearModule: te.GroupedLinear module API tests - TestGroupedGemm: raw grouped GEMM kernel (cpp_extensions) tests - TestOpsGroupedLinear: te.ops.GroupedLinear tests - TestGroupedMLP: grouped MLP pattern tests Each class carries autouse fixtures for the environment variables it needs (NVTE_GROUPED_LINEAR_USE_FUSED_GROUPED_GEMM, NVTE_GROUPED_LINEAR_SINGLE_PARAM, NVTE_CUTEDSL_FUSED_GROUPED_MLP), replacing runtime os.environ skip checks. Also fix three coverage/correctness issues flagged in review: - _ALL_BOOLEAN / _mxfp8_available were aliases defined after the class that used them in decorators, causing NameError at collection time; replaced with the originals (all_boolean, mxfp8_available, etc.) - test_grouped_mlp was missing hidden_size=256 from parametrization - test_grouped_mlp weight tensors were missing quantizer_role="weight", which matters for NVFP4 RHT quantization behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Pull repeated grouped-linear test setup into shared helpers for environment toggles, split-size construction, grouped parameter copying, and grad collection. This keeps the module, ops, and grouped-MLP coverage aligned around the same single-grouped-parameter conventions instead of duplicating local ad hoc loops. Also make the grouped tensor path comparison look at the actual single grouped bias parameter when that mode is enabled, matching the idiom already used by the fusible ops tests. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Test was inadvertently disabled, and enabling triggered test failures beyond the scope of this PR. Grouped tensor params are still highly experimental, and it was quite strange to only test grouped bias without grouped weights. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Prioritize configs that trigger op fusion. Separate parametrized cases for advanced Mcore integrations. Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR reorganizes grouped linear and grouped MLP tests: exhaustive tests are moved from
Confidence Score: 5/5Pure test refactoring with no production code changes; all coverage is preserved and correctly reorganized. The changes are limited to test files and the CI shell script. All previously tested configurations remain covered — either directly in the reorganized classes or merged via new parametrization (e.g., save_original_input folded into test_padding_grouped_linear_accuracy). The only minor issue found is an unused is_glu_activation import introduced during the move. tests/pytorch/test_grouped_linear.py has a dead import (is_glu_activation) that can be dropped. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph Before
A[test_fusible_ops.py\nGroupedLinear + GroupedMLP tests\n+ helper functions]
B[test_grouped_linear.py\nModule-level tests\n_reset_fp8_state autouse fixture]
C[utils.py\nNo quantization helpers]
end
subgraph After
D[test_fusible_ops.py\nBasic GroupedLinear smoke tests only]
E[test_grouped_linear.py\nTestGroupedLinearModule\nTestGroupedGemm\nTestGroupedLinearOps\nTestGroupedMLP]
F[utils.py\nmaybe_skip_quantization\nmake_reference_and_test_tensors\nQuantizer availability globals]
end
A -->|move exhaustive tests| E
B -->|reorganize into classes| E
A -->|extract helpers| F
C -->|add helpers| F
A -->|keep slim| D
Reviews (2): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile |
| @pytest.mark.parametrize("quantization", _quantization_list) | ||
| @pytest.mark.parametrize("quantized_compute", (False, True)) | ||
| @pytest.mark.parametrize("quantized_weight", (False, True)) | ||
| @pytest.mark.parametrize("input_requires_grad", (False, True)) | ||
| @pytest.mark.parametrize("weight_requires_grad", (False, True)) | ||
| def test_grouped_linear( | ||
| self, | ||
| *, | ||
| group_size: int = 4, | ||
| bias: bool, | ||
| weight_shape: tuple = (128, 128), | ||
| split_alignment: int = 128, | ||
| dtype: torch.dtype, | ||
| device: torch.device = "cuda", | ||
| quantization: Optional[str], | ||
| quantized_compute: bool, | ||
| quantized_weight: bool, | ||
| input_requires_grad: bool, | ||
| weight_requires_grad: bool, | ||
| delay_wgrad_compute: bool, | ||
| single_grouped_weight: bool, | ||
| single_grouped_bias: bool, | ||
| ) -> None: | ||
| """te.ops.GroupedLinear forward+backward accuracy""" | ||
|
|
||
| # Split sizes | ||
| split_sizes = _make_grouped_split_sizes( | ||
| group_size, | ||
| split_alignment, | ||
| dtype=torch.int, | ||
| device=device, | ||
| ) | ||
|
|
||
| def test_grouped_linear_grouped_tensor_path_single_grouped_bias_delay_wgrad(monkeypatch): | ||
| if torch.cuda.get_device_capability() < (10, 0): | ||
| pytest.skip("GroupedTensor grouped GEMM path requires SM100+") | ||
| # Make input and weight shapes consistent | ||
| out_features, in_features = weight_shape | ||
| in_shape = (split_sizes.sum().item(), in_features) | ||
| out_shape = (in_shape[0], out_features) | ||
|
|
||
| # Skip invalid configurations | ||
| maybe_skip_quantization(quantization, dims=in_shape, device=device, dtype=dtype) | ||
| maybe_skip_quantization(quantization, dims=out_shape) | ||
| if quantization is None and (quantized_compute or quantized_weight): | ||
| pytest.skip("Quantization scheme is not specified") | ||
| if quantization is not None and not (quantized_compute or quantized_weight): | ||
| pytest.skip("Quantization scheme is not used") | ||
| if quantization is not None and dtype not in (torch.bfloat16, torch.float16): | ||
| pytest.skip("Quantized group GEMM is only supported with BF16/FP16") | ||
| if single_grouped_bias and not bias: | ||
| pytest.skip("single_grouped_bias requires bias=True") | ||
| if ( | ||
| single_grouped_weight | ||
| and quantized_weight | ||
| and quantization in ("fp8_delayed_scaling", "fp8_current_scaling") | ||
| ): | ||
| pytest.skip( | ||
| "single_grouped_weight does not support FP8 delayed/current scaling " | ||
| "with quantized_model_init" | ||
| ) |
There was a problem hiding this comment.
Missing
nvfp4_4over6 skip in TestGroupedLinearOps.test_grouped_linear
The old test_fusible_ops.py::TestBasicOps::test_grouped_linear explicitly skipped nvfp4_4over6 with the message "NVFP4 4over6 grouped quantization is not supported". This skip was dropped when the test was moved here. Meanwhile, the skip is still present in _skip_invalid_grouped_mlp_case (line 540) for the grouped MLP tests. If this limitation still applies to the basic te.ops.GroupedLinear path, then on systems with NVFP4 support this test will fail at runtime rather than being skipped.
There was a problem hiding this comment.
It turns out the grouped linear op does actually support NVFP4 4over6, so there's no harm in including it in tests. However, the grouped MLP fused op definitely does not support NVFP4 4over6 and test_grouped_mlp is focused on cases that involve the fused op.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
Description
test_fusible_ops.pywas becoming a dumping ground for random grouped MLP tests, including tests that didn't involve fusible ops at all. This PR reorganizes the tests so thattest_fusible_ops.pyholds basic tests forte.ops.GroupedLinear, whiletest_grouped_linear.pyholds the exhaustive tests for all the various grouped MLP fused ops. I've also tried trimming down excessive test parametrization to bring down the test time from ~20 min to ~1 min.Type of change
Changes
test_fusible_ops.pytotest_grouped_linear.py.test_grouped_linear.pyinto test suites.Checklist: