get memory info#249
Conversation
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
There was a problem hiding this comment.
Pull request overview
Adds a new XPU memory info operator backed by Level Zero and exposes it through Torch bindings, with an accompanying pytest.
Changes:
- Implement
getMemoryInfo(device_index)using Level Zero device properties. - Register the new op in the Torch extension and wire it into the build.
- Add a pytest validating returned values against
torch.xpu.mem_get_info.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_get_memory_info.py | Adds coverage for the new getMemoryInfo op vs PyTorch’s XPU memory API |
| csrc/utils/mem_info.cpp | Implements Level Zero-based total/usable memory queries |
| csrc/torch_bindings.cpp | Registers the new Torch op schema + implementation |
| csrc/ops.h | Declares getMemoryInfo for binding |
| CMakeLists.txt | Adds build/link flags and compiles the new source file |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <level_zero/ze_api.h> | ||
| #include <sycl/sycl.hpp> | ||
|
|
||
| #include <iostream> |
There was a problem hiding this comment.
std::tuple and std::numeric_limits are used in this file but the required standard headers are not included. This can fail to compile depending on transitive includes. Add the missing includes (<tuple> and <limits>) explicitly in this file.
| #include <iostream> | |
| #include <iostream> | |
| #include <tuple> | |
| #include <limits> |
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | ||
| auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount]; |
There was a problem hiding this comment.
The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.
| pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | ||
| pMemoryProperties[mem].pNext = nullptr; | ||
| } | ||
| zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties); |
There was a problem hiding this comment.
The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.
| deviceProperties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES; | ||
| deviceProperties.pNext = &usableMemProps; | ||
|
|
||
| zeDeviceGetProperties(device, &deviceProperties); |
There was a problem hiding this comment.
The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.
|
|
||
| size_t getTotalMemory(ze_device_handle_t& device) { | ||
| uint32_t memoryCount = 0; | ||
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | ||
| auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount]; | ||
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | ||
| pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | ||
| pMemoryProperties[mem].pNext = nullptr; | ||
| } | ||
| zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties); | ||
| size_t totalMemory = 0; | ||
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | ||
| totalMemory += pMemoryProperties[mem].totalSize; | ||
| } | ||
| delete[] pMemoryProperties; |
There was a problem hiding this comment.
Manual new[]/delete[] makes this code harder to maintain and less exception-safe. Prefer std::vector<ze_device_memory_properties_t> sized to memoryCount and pass .data() to Level Zero. This also simplifies cleanup and reduces the chance of leaks on early returns/errors.
| size_t getTotalMemory(ze_device_handle_t& device) { | |
| uint32_t memoryCount = 0; | |
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | |
| auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount]; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | |
| pMemoryProperties[mem].pNext = nullptr; | |
| } | |
| zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties); | |
| size_t totalMemory = 0; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| totalMemory += pMemoryProperties[mem].totalSize; | |
| } | |
| delete[] pMemoryProperties; | |
| #include <vector> | |
| size_t getTotalMemory(ze_device_handle_t& device) { | |
| uint32_t memoryCount = 0; | |
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | |
| std::vector<ze_device_memory_properties_t> memoryProperties(memoryCount); | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| memoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | |
| memoryProperties[mem].pNext = nullptr; | |
| } | |
| zeDeviceGetMemoryProperties(device, &memoryCount, memoryProperties.data()); | |
| size_t totalMemory = 0; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| totalMemory += memoryProperties[mem].totalSize; | |
| } |
|
|
||
| size_t getTotalMemory(ze_device_handle_t& device) { | ||
| uint32_t memoryCount = 0; | ||
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | ||
| auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount]; | ||
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | ||
| pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | ||
| pMemoryProperties[mem].pNext = nullptr; | ||
| } | ||
| zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties); | ||
| size_t totalMemory = 0; | ||
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | ||
| totalMemory += pMemoryProperties[mem].totalSize; | ||
| } | ||
| delete[] pMemoryProperties; |
There was a problem hiding this comment.
Manual new[]/delete[] makes this code harder to maintain and less exception-safe. Prefer std::vector<ze_device_memory_properties_t> sized to memoryCount and pass .data() to Level Zero. This also simplifies cleanup and reduces the chance of leaks on early returns/errors.
| size_t getTotalMemory(ze_device_handle_t& device) { | |
| uint32_t memoryCount = 0; | |
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | |
| auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount]; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | |
| pMemoryProperties[mem].pNext = nullptr; | |
| } | |
| zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties); | |
| size_t totalMemory = 0; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| totalMemory += pMemoryProperties[mem].totalSize; | |
| } | |
| delete[] pMemoryProperties; | |
| #include <vector> | |
| size_t getTotalMemory(ze_device_handle_t& device) { | |
| uint32_t memoryCount = 0; | |
| zeDeviceGetMemoryProperties(device, &memoryCount, nullptr); | |
| std::vector<ze_device_memory_properties_t> memoryProperties(memoryCount); | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| memoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES; | |
| memoryProperties[mem].pNext = nullptr; | |
| } | |
| zeDeviceGetMemoryProperties(device, &memoryCount, memoryProperties.data()); | |
| size_t totalMemory = 0; | |
| for (uint32_t mem = 0; mem < memoryCount; ++mem) { | |
| totalMemory += memoryProperties[mem].totalSize; | |
| } |
| if (total > static_cast<size_t>(std::numeric_limits<int64_t>::max()) || | ||
| free > static_cast<size_t>(std::numeric_limits<int64_t>::max())) { | ||
| std::cerr << "Memory size exceeds int64_t max value!" << std::endl; | ||
| return {-1, -1}; // or handle this case as appropriate | ||
| } |
There was a problem hiding this comment.
Returning (-1, -1) on overflow and printing to stderr is not a great contract for a Torch op (callers can’t reliably distinguish errors vs real values, and stderr may be lost). Prefer raising a Torch exception (e.g., TORCH_CHECK(false, ...)) so Python callers get a clear failure with a message.
| list(APPEND SYCL_FLAGS "-fsycl") | ||
| # For l0 ops like getMemoryInfo | ||
| list(APPEND SYCL_FLAGS "-lze_loader") | ||
| list(APPEND SYCL_FLAGS "-Wno-unused-command-line-argument") |
There was a problem hiding this comment.
Linker flags (-lze_loader) should not be added to compile flags (SYCL_FLAGS). The follow-up -Wno-unused-command-line-argument looks like it’s masking the resulting warning rather than fixing the root cause. Move -lze_loader to link settings only (or preferably use target_link_libraries(... ze_loader) / target_link_options) and drop the unused-command-line suppression if it becomes unnecessary.
|
|
||
| if not torch.ops._xpu_C.is_pvc(device): | ||
| assert total == ref_total | ||
| assert free == 0 |
There was a problem hiding this comment.
The test asserts free == 0, but getMemoryInfo() returns the device’s usable/free memory, which should generally be non-zero on an idle device. This makes the test fail (or pass only by accident if the API returns 0 due to unsupported properties). Instead, assert free against ref_free (and total against ref_total, including the PVC case if behavior is expected to match), or assert a sensible relation like 0 <= free <= total if exact match isn’t guaranteed.
| assert free == 0 | |
| assert free == ref_free | |
| else: | |
| assert 0 <= free <= total |
|
|
||
| @pytest.mark.parametrize("device", DEVICES) | ||
| def test_get_memory_info(device) -> None: | ||
| free, total = torch.ops._C_cache_ops.getMemoryInfo(device) |
There was a problem hiding this comment.
why place it in _C_cache_ops? I prefer _xpu_C
There was a problem hiding this comment.
Three files named mem_XXX are put at csrc/utils and placed in _C_cache_ops or _C.
What should it be?
There was a problem hiding this comment.
I feel such mem api is for memory copy which will be used in cache context. @chaojun-zhang any comments?
|
|
||
| if not torch.ops._xpu_C.is_pvc(device): | ||
| assert total == ref_total | ||
| assert free == 0 |
There was a problem hiding this comment.
free==0 because current UMD still return 0 and it will return correct usable memory after UMD upgrade?
There was a problem hiding this comment.
Yes, and this PR should not be merge until UMD is ready.
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Should update l0 as below:
and update NEO after this version:
https://github.com/intel/compute-runtime/releases/tag/26.18.38308.1