Add buffer-count introspection probes (get_active_resource_count, get_cache_count)#3464
Add buffer-count introspection probes (get_active_resource_count, get_cache_count)#3464Ogilthorp3 wants to merge 1 commit intoml-explore:mainfrom
Conversation
…r pressure Adds two read-only diagnostic functions to `mlx.core` (and the matching C++ API): - get_active_resource_count() — total GPU buffers (active + cached) - get_cache_count() — buffers currently in the buffer cache On Metal these return the MTLBuffer count, which is the dimension that hits the per-process resource limit (~499000 on Apple silicon) when descriptor pressure builds up — see ml-explore/mlx-lm#1185, where the leak was diagnosed only post-mortem because there was no way to observe the count from Python. CUDA returns 0 from both probes (the CUDA resource limit is bytes, not handles), mirroring how set_wired_limit is stubbed there. No behavior change. The cache eviction policy is unchanged; this commit only exposes counters that already existed internally (num_resources_ and buffer_pool_.size()). Tests: extended tests/allocator_tests.cpp with monotonicity checks (active count grows after malloc, cache count grows after free, both drop to baseline after clear_cache). Docs: added the two functions to docs/src/python/memory_management.rst.
zcbenz
left a comment
There was a problem hiding this comment.
For testing cache grow the get_cache_memory API is already enough and we don't need a new get_cache_count API.
get_active_resource_count is a bad name for returning number of buffers because things like Events are usually also considered as resource.
For ml-explore/mlx-lm#1185 the root cause is simply too many buffers are created, you can try to adjust the MLX_MAX_OPS_PER_BUFFER and MLX_BFS_MAX_WIDTH envs to work around it, but I think the ultimate solution is to improve the allocator (by creating a big MTLBuffer and allocating from it). So I think the APIs in this PR won't help solve the problem.
|
Fair points — We'll go with the Closing this out, thanks for the direction. |
|
Closing per maintainer feedback — the proposed APIs don't address the root cause. |
Summary
Adds two read-only diagnostic functions to
mlx.core(and the matching C++ API):get_active_resource_count()— total GPU buffers (active + cached)get_cache_count()— buffers currently in the buffer cacheOn Metal these return the MTLBuffer count, which is the dimension that hits the per-process resource limit (~499000 on Apple silicon) when descriptor pressure builds up. CUDA returns
0from both probes (the CUDA resource limit is bytes, not handles), mirroring howset_wired_limitis stubbed there.No behavior change. The cache eviction policy is unchanged; this PR only exposes counters that already existed internally (
num_resources_andbuffer_pool_.size()).Why
Filed in response to a direct request from the ml-explore/mlx-lm#1185 thread:
That issue is the descriptor-count leak in LoRA training of Qwen3.5/3.6 and Gemma3 — flat memory, crash at
[metal::malloc] Resource limit (499000) exceededrather than byte OOM. Without these probes, users can only diagnose post-mortem by counting iterations until crash; with them, you can see the count climbing in real time and try workarounds (mx.synchronize() + mx.clear_cache()at a coarser cadence, fixed-shape padding, etc.) deliberately rather than by trial.I held back the eviction-policy commit from this PR — see "What's NOT in this PR" below.
What's in this PR
mlx/memory.hMLX_APIdeclarationsmlx/backend/common/buffer_cache.hcache_count()accessor onBufferCachemlx/backend/metal/allocator.{h,cpp}mlx/backend/cuda/allocator.cpppython/src/memory.cppm.defbindings with docstringsdocs/src/python/memory_management.rsttests/allocator_tests.cpp87 lines added, 0 removed.
What's NOT in this PR
A separate count-aware eviction policy (count-driven trigger alongside the existing byte trigger in
MetalAllocator::malloc) gets the synthetic 1000-unique-size test passing, but in real LoRA training of Qwen3.6-35B-A3B-4bit cache_limit values of 8/2/1 buffers all still crashed. So either (a) eviction isn't releasing buffers as fast as they're allocated, (b) there's a second leak source upstream of the cache (the compile cache?), or (c) something else. I have empirical probe data per-iter showing 47k → 128k → 145k MTLBuffer jumps under that workload that I'd be happy to share separately if it's useful for diagnosis. I didn't want to put a behavior change in front of you while the root cause isn't fully nailed down.This PR is therefore strictly a diagnostic foundation that helps users (and you) measure the bug in mlx-lm#1185 without committing to a specific fix.
Test plan
cmake --build . --target testsbuilds clean (Metal backend, MLX_BUILD_TESTS=ON)./tests/tests --test-case="*introspection*"— 1 passed, 0 failed./tests/tests --test-case="*allocation*"— existing tests still passclang-format -iproduces no diff on changed filesset_wired_limitpattern but please flag if the wiring needs adjustment