Commit 358ee83
authored
updated bmm and matmul for GPT-OSS (#999)
### What does this PR do?
This PR fixes maximum recursion bug for GPT-OSS. It replaces
`torch._bmm` and `torch.matmul` with `torch.ops.aten.bmm` and
`torch.ops.aten.matmul` to avoid recursion
### Usage
```shell
Docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc4
[Repro Steps]:
[gpt-oss]
Step1:
accelerate launch --config_file configs/zero3.yaml sft.py --config configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b --output_dir /tmp/pytest-of-root/pytest-0/test_gpt_oss_complete_pipeline0/gpt-oss-20b-sft
Step 1 completed: SFT checkpoint at /tmp/pytest-of-root/pytest-0/test_gpt_oss_complete_pipeline0/gpt-oss-20b-sft
Step2:
accelerate launch --config_file configs/zero3.yaml sft.py --config configs/sft_full.yaml --model_name_or_path /tmp/pytest-of-root/pytest-0/test_gpt_oss_complete_pipeline0/gpt-oss-20b-sft --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir /tmp/pytest-of-root/pytest-0/test_gpt_oss_complete_pipeline0/gpt-oss-20b-qat
```
### Testing
``` python
pytest tests/examples/gpt_oss/test_gpt_oss_qat.py
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`:N/A
- Did you write any new necessary tests?: N/A (test already exist)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Fixed a recursion-related instability in attention quantization that
could cause errors during certain matrix operations, improving
reliability.
* **Performance**
* Improved handling of batched and matrix-multiplication operations
under quantization for more consistent and efficient runtime behavior,
including better support for outputs specified by callers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>1 parent a5d46ff commit 358ee83
1 file changed
Lines changed: 10 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1040 | 1040 | | |
1041 | 1041 | | |
1042 | 1042 | | |
1043 | | - | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
1044 | 1049 | | |
1045 | 1050 | | |
1046 | | - | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
1047 | 1054 | | |
1048 | 1055 | | |
1049 | 1056 | | |
1050 | 1057 | | |
1051 | | - | |
| 1058 | + | |
1052 | 1059 | | |
1053 | 1060 | | |
1054 | 1061 | | |
| |||
0 commit comments