Skip to content

Add buffer-count introspection probes (get_active_resource_count, get_cache_count)#3464

Open
Ogilthorp3 wants to merge 1 commit intoml-explore:mainfrom
Ogilthorp3:probe/1185-introspection
Open

Add buffer-count introspection probes (get_active_resource_count, get_cache_count)#3464
Ogilthorp3 wants to merge 1 commit intoml-explore:mainfrom
Ogilthorp3:probe/1185-introspection

Conversation

@Ogilthorp3
Copy link
Copy Markdown

Summary

Adds two read-only diagnostic functions to mlx.core (and the matching C++ API):

  • get_active_resource_count() — total GPU buffers (active + cached)
  • get_cache_count() — buffers currently in the buffer cache

On Metal these return the MTLBuffer count, which is the dimension that hits the per-process resource limit (~499000 on Apple silicon) when descriptor pressure builds up. CUDA returns 0 from both probes (the CUDA resource limit is bytes, not handles), mirroring how set_wired_limit is stubbed there.

No behavior change. The cache eviction policy is unchanged; this PR only exposes counters that already existed internally (num_resources_ and buffer_pool_.size()).

Why

Filed in response to a direct request from the ml-explore/mlx-lm#1185 thread:

An mx.metal.active_resource_count() probe would be incredibly useful — currently we're inferring the leak from crash iteration rather than measuring it.
@lawcontinue (comment)

That issue is the descriptor-count leak in LoRA training of Qwen3.5/3.6 and Gemma3 — flat memory, crash at [metal::malloc] Resource limit (499000) exceeded rather than byte OOM. Without these probes, users can only diagnose post-mortem by counting iterations until crash; with them, you can see the count climbing in real time and try workarounds (mx.synchronize() + mx.clear_cache() at a coarser cadence, fixed-shape padding, etc.) deliberately rather than by trial.

I held back the eviction-policy commit from this PR — see "What's NOT in this PR" below.

What's in this PR

File Change
mlx/memory.h Two new MLX_API declarations
mlx/backend/common/buffer_cache.h cache_count() accessor on BufferCache
mlx/backend/metal/allocator.{h,cpp} Member fns + free-fn dispatchers
mlx/backend/cuda/allocator.cpp Stubs returning 0
python/src/memory.cpp Two m.def bindings with docstrings
docs/src/python/memory_management.rst Two new entries
tests/allocator_tests.cpp Monotonicity test (no eviction-policy assumptions)

87 lines added, 0 removed.

What's NOT in this PR

A separate count-aware eviction policy (count-driven trigger alongside the existing byte trigger in MetalAllocator::malloc) gets the synthetic 1000-unique-size test passing, but in real LoRA training of Qwen3.6-35B-A3B-4bit cache_limit values of 8/2/1 buffers all still crashed. So either (a) eviction isn't releasing buffers as fast as they're allocated, (b) there's a second leak source upstream of the cache (the compile cache?), or (c) something else. I have empirical probe data per-iter showing 47k → 128k → 145k MTLBuffer jumps under that workload that I'd be happy to share separately if it's useful for diagnosis. I didn't want to put a behavior change in front of you while the root cause isn't fully nailed down.

This PR is therefore strictly a diagnostic foundation that helps users (and you) measure the bug in mlx-lm#1185 without committing to a specific fix.

Test plan

  • cmake --build . --target tests builds clean (Metal backend, MLX_BUILD_TESTS=ON)
  • ./tests/tests --test-case="*introspection*" — 1 passed, 0 failed
  • ./tests/tests --test-case="*allocation*" — existing tests still pass
  • clang-format -i produces no diff on changed files
  • CUDA build — I don't have a CUDA box; the stubs follow the set_wired_limit pattern but please flag if the wiring needs adjustment

…r pressure

Adds two read-only diagnostic functions to `mlx.core` (and the matching C++
API):

  - get_active_resource_count() — total GPU buffers (active + cached)
  - get_cache_count()           — buffers currently in the buffer cache

On Metal these return the MTLBuffer count, which is the dimension that hits
the per-process resource limit (~499000 on Apple silicon) when descriptor
pressure builds up — see ml-explore/mlx-lm#1185, where the leak was
diagnosed only post-mortem because there was no way to observe the count
from Python.

CUDA returns 0 from both probes (the CUDA resource limit is bytes, not
handles), mirroring how set_wired_limit is stubbed there.

No behavior change. The cache eviction policy is unchanged; this commit
only exposes counters that already existed internally (num_resources_ and
buffer_pool_.size()).

Tests: extended tests/allocator_tests.cpp with monotonicity checks (active
count grows after malloc, cache count grows after free, both drop to
baseline after clear_cache). Docs: added the two functions to
docs/src/python/memory_management.rst.
Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing cache grow the get_cache_memory API is already enough and we don't need a new get_cache_count API.

get_active_resource_count is a bad name for returning number of buffers because things like Events are usually also considered as resource.

For ml-explore/mlx-lm#1185 the root cause is simply too many buffers are created, you can try to adjust the MLX_MAX_OPS_PER_BUFFER and MLX_BFS_MAX_WIDTH envs to work around it, but I think the ultimate solution is to improve the allocator (by creating a big MTLBuffer and allocating from it). So I think the APIs in this PR won't help solve the problem.

@lawcontinue
Copy link
Copy Markdown

Fair points — get_cache_memory does cover the diagnostic need. The naming concern is valid too, Events are resources in the Metal model.

We'll go with the MLX_MAX_OPS_PER_BUFFER workaround for now. The big-allocator approach sounds like the right long-term fix — will keep an eye on that.

Closing this out, thanks for the direction.

@lawcontinue
Copy link
Copy Markdown

Closing per maintainer feedback — the proposed APIs don't address the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants