Skip to content

get memory info#249

Draft
mayuyuace wants to merge 5 commits into
vllm-project:mainfrom
mayuyuace:qiming/getMemoryInfo
Draft

get memory info#249
mayuyuace wants to merge 5 commits into
vllm-project:mainfrom
mayuyuace:qiming/getMemoryInfo

Conversation

@mayuyuace
Copy link
Copy Markdown
Collaborator

@mayuyuace mayuyuace commented Apr 2, 2026

Should update l0 as below:

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
add-apt-repository -y ppa:kobuk-team/intel-graphics
apt-get remove level-zero
apt install -y libze1 libze-dev

and update NEO after this version:
https://github.com/intel/compute-runtime/releases/tag/26.18.38308.1

Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Copilot AI review requested due to automatic review settings April 2, 2026 09:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new XPU memory info operator backed by Level Zero and exposes it through Torch bindings, with an accompanying pytest.

Changes:

  • Implement getMemoryInfo(device_index) using Level Zero device properties.
  • Register the new op in the Torch extension and wire it into the build.
  • Add a pytest validating returned values against torch.xpu.mem_get_info.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_get_memory_info.py Adds coverage for the new getMemoryInfo op vs PyTorch’s XPU memory API
csrc/utils/mem_info.cpp Implements Level Zero-based total/usable memory queries
csrc/torch_bindings.cpp Registers the new Torch op schema + implementation
csrc/ops.h Declares getMemoryInfo for binding
CMakeLists.txt Adds build/link flags and compiles the new source file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/utils/mem_info.cpp
#include <level_zero/ze_api.h>
#include <sycl/sycl.hpp>

#include <iostream>
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::tuple and std::numeric_limits are used in this file but the required standard headers are not included. This can fail to compile depending on transitive includes. Add the missing includes (<tuple> and <limits>) explicitly in this file.

Suggested change
#include <iostream>
#include <iostream>
#include <tuple>
#include <limits>

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
Comment on lines +9 to +10
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount];
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
pMemoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
deviceProperties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES;
deviceProperties.pNext = &usableMemProps;

zeDeviceGetProperties(device, &deviceProperties);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Level Zero APIs return a ze_result_t, but the results are currently ignored. If any of these calls fail, the code may return incorrect sizes (e.g., memoryCount staying 0) without surfacing an error to Python. Capture the return values and raise a proper Torch error (e.g., TORCH_CHECK(result == ZE_RESULT_SUCCESS, ...)) so failures are actionable.

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
Comment on lines +6 to +20

size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount];
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
pMemoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties);
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += pMemoryProperties[mem].totalSize;
}
delete[] pMemoryProperties;
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual new[]/delete[] makes this code harder to maintain and less exception-safe. Prefer std::vector<ze_device_memory_properties_t> sized to memoryCount and pass .data() to Level Zero. This also simplifies cleanup and reduces the chance of leaks on early returns/errors.

Suggested change
size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount];
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
pMemoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties);
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += pMemoryProperties[mem].totalSize;
}
delete[] pMemoryProperties;
#include <vector>
size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
std::vector<ze_device_memory_properties_t> memoryProperties(memoryCount);
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
memoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
memoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, memoryProperties.data());
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += memoryProperties[mem].totalSize;
}

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
Comment on lines +6 to +20

size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount];
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
pMemoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties);
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += pMemoryProperties[mem].totalSize;
}
delete[] pMemoryProperties;
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual new[]/delete[] makes this code harder to maintain and less exception-safe. Prefer std::vector<ze_device_memory_properties_t> sized to memoryCount and pass .data() to Level Zero. This also simplifies cleanup and reduces the chance of leaks on early returns/errors.

Suggested change
size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
auto pMemoryProperties = new ze_device_memory_properties_t[memoryCount];
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
pMemoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
pMemoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, pMemoryProperties);
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += pMemoryProperties[mem].totalSize;
}
delete[] pMemoryProperties;
#include <vector>
size_t getTotalMemory(ze_device_handle_t& device) {
uint32_t memoryCount = 0;
zeDeviceGetMemoryProperties(device, &memoryCount, nullptr);
std::vector<ze_device_memory_properties_t> memoryProperties(memoryCount);
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
memoryProperties[mem].stype = ZE_STRUCTURE_TYPE_DEVICE_MEMORY_PROPERTIES;
memoryProperties[mem].pNext = nullptr;
}
zeDeviceGetMemoryProperties(device, &memoryCount, memoryProperties.data());
size_t totalMemory = 0;
for (uint32_t mem = 0; mem < memoryCount; ++mem) {
totalMemory += memoryProperties[mem].totalSize;
}

Copilot uses AI. Check for mistakes.
Comment thread csrc/utils/mem_info.cpp
Comment on lines +44 to +48
if (total > static_cast<size_t>(std::numeric_limits<int64_t>::max()) ||
free > static_cast<size_t>(std::numeric_limits<int64_t>::max())) {
std::cerr << "Memory size exceeds int64_t max value!" << std::endl;
return {-1, -1}; // or handle this case as appropriate
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning (-1, -1) on overflow and printing to stderr is not a great contract for a Torch op (callers can’t reliably distinguish errors vs real values, and stderr may be lost). Prefer raising a Torch exception (e.g., TORCH_CHECK(false, ...)) so Python callers get a clear failure with a message.

Copilot uses AI. Check for mistakes.
Comment thread CMakeLists.txt
Comment on lines 174 to +177
list(APPEND SYCL_FLAGS "-fsycl")
# For l0 ops like getMemoryInfo
list(APPEND SYCL_FLAGS "-lze_loader")
list(APPEND SYCL_FLAGS "-Wno-unused-command-line-argument")
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linker flags (-lze_loader) should not be added to compile flags (SYCL_FLAGS). The follow-up -Wno-unused-command-line-argument looks like it’s masking the resulting warning rather than fixing the root cause. Move -lze_loader to link settings only (or preferably use target_link_libraries(... ze_loader) / target_link_options) and drop the unused-command-line suppression if it becomes unnecessary.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_get_memory_info.py Outdated

if not torch.ops._xpu_C.is_pvc(device):
assert total == ref_total
assert free == 0
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test asserts free == 0, but getMemoryInfo() returns the device’s usable/free memory, which should generally be non-zero on an idle device. This makes the test fail (or pass only by accident if the API returns 0 due to unsupported properties). Instead, assert free against ref_free (and total against ref_total, including the PVC case if behavior is expected to match), or assert a sensible relation like 0 <= free <= total if exact match isn’t guaranteed.

Suggested change
assert free == 0
assert free == ref_free
else:
assert 0 <= free <= total

Copilot uses AI. Check for mistakes.

@pytest.mark.parametrize("device", DEVICES)
def test_get_memory_info(device) -> None:
free, total = torch.ops._C_cache_ops.getMemoryInfo(device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why place it in _C_cache_ops? I prefer _xpu_C

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three files named mem_XXX are put at csrc/utils and placed in _C_cache_ops or _C.
What should it be?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel such mem api is for memory copy which will be used in cache context. @chaojun-zhang any comments?

Comment thread tests/test_get_memory_info.py Outdated

if not torch.ops._xpu_C.is_pvc(device):
assert total == ref_total
assert free == 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free==0 because current UMD still return 0 and it will return correct usable memory after UMD upgrade?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and this PR should not be merge until UMD is ready.

@mayuyuace mayuyuace marked this pull request as draft April 17, 2026 01:07
jikunshang and others added 3 commits May 13, 2026 11:55
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants