You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch (#998)
### What does this PR do?
**Type of change**: Bug fix
**Overview**: Replace CUDA memory management from CUDART to PyTorch
(higher-level API).
### Usage
```python
# Add a code snippet demonstrating how to use this
```
### Testing
1. Added unittests.
2. Tested that this PR does not break
#951 or
#978
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, using
`torch.load(..., weights_only=True)`, avoiding `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other source, did you follow IP policy in
[CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?:
N/A <!--- Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->
### Additional Information
Summary of changes in `benchmark.py — TensorRTPyBenchmark`:
| What changed | Before | After |
|---|---|---|
| Imports | `contextlib` + `from cuda.bindings import runtime as cudart`
| `import torch` (conditional) |
| Availability flag | `CUDART_AVAILABLE` | `TORCH_CUDA_AVAILABLE =
torch.cuda.is_available()` |
| `__init__` guard | checks `CUDART_AVAILABLE or cudart is None` |
checks `TORCH_CUDA_AVAILABLE` |
| `_alloc_pinned_host` | `cudaMallocHost` + ctypes address hack, returns
`(ptr, arr, err)` | `torch.empty(...).pin_memory()`, returns `(tensor,
tensor.numpy())` |
| `_free_buffers` | `cudaFreeHost` + `cudaFree` per buffer |
`bufs.clear()` — PyTorch GC handles deallocation |
| `_allocate_buffers` | raw `device_ptr` integers, error-code returns |
`torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address |
| `_run_warmup` | `cudaMemcpyAsync` + `cudaStreamSynchronize` |
`tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()` |
| `_run_timing` | same cudart pattern | same torch pattern |
| `run` — stream lifecycle | `cudaStreamCreate()` /
`cudaStreamDestroy()` | `torch.cuda.Stream()` / `del stream` |
| `run` — stream arg to TRT | raw integer handle | `stream.cuda_stream`
(integer property) |
| Error handling | `cudaError_t` return codes | PyTorch raises
`RuntimeError`, caught by existing `except Exception` |
Related to #961
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* TensorRT benchmarking migrated from direct CUDA runtime calls to
PyTorch CUDA tensors, pinned memory, and CUDA stream primitives —
simplifying buffer management, transfers, and timing semantics.
* **Tests**
* Expanded GPU autotune benchmark tests with broader unit and
integration coverage for CUDA/TensorRT paths, pinned-host/device
buffering, stream behavior, warmup/timing, and end-to-end latency
scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
0 commit comments