Skip to content

[Core] Fix OOM kill msg wrong threshold w/ resource isolation#62948

Merged
edoakes merged 11 commits into
ray-project:masterfrom
owenowenisme:`core/fix-oom-threshold-message`
May 14, 2026
Merged

[Core] Fix OOM kill msg wrong threshold w/ resource isolation#62948
edoakes merged 11 commits into
ray-project:masterfrom
owenowenisme:`core/fix-oom-threshold-message`

Conversation

@owenowenisme
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme commented Apr 26, 2026

Description

The OOM kill message re-computed the memory threshold at kill time via GetMemoryThreshold(), which under --enable-resource-isolation could read a different cgroup memory.max value than at init time (e.g. "max" instead of digits), silently falling back to 0.95. Fix: have each monitor pass the threshold it actually fired on through the callback instead of letting NodeManager guess.

Now the OOM msg should be something like this

  Memory on the node (IP: 10.0.0.1, ID: abc123) was 7.20GB / 8.00GB (0.900000);
  OOM kill reason: Memory usage 7728742400B exceeded threshold of 6871947674B (85.9% of 8589934592B total);
  Object store memory usage: [...];
  Ray killed 1 worker(s) based on the killing policy: [...];

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@owenowenisme owenowenisme requested a review from a team as a code owner April 26, 2026 03:52
@owenowenisme owenowenisme added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the KillWorkersCallback signature to include the memory threshold that triggered the worker killing. This change allows memory monitors to propagate the specific threshold used (or a null value for event-driven monitors) directly to the NodeManager. Consequently, the NodeManager no longer needs to re-compute the threshold when generating OOM kill detail messages, leading to more accurate reporting of the triggering conditions. Various monitor implementations, tests, and the NodeManager's message formatting logic have been updated to support this new parameter. I have no feedback to provide as there are no review comments.

Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping with the kill worker call path! Like you mentioned, the disconnect between when the memory monitors decides to trigger an oom kill and when the node manager execute the killing policy to decide what workers to kill can lead to very confusing log messages.

We've actually decided to move the final decision for whether or not we decide to kill workers completely into the node manager as it has a more up-to-date view on the system memory utilization at the time of the kill callback execution compared to when the memory monitor decides to trigger the callback. This lead us to eliminate all arguments from the kill worker callback: https://github.com/ray-project/ray/blob/master/src/ray/raylet/node_manager.cc#L3076.

However, I do believe it's important to make it clear why the monitor decided to trigger the callback and how the node manager separately determined how many nodes to kill. Do you think passing a reason for kill message from the memory monitor and logging it in the OOM kill message would help with the issue you observed?

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
@owenowenisme owenowenisme force-pushed the `core/fix-oom-threshold-message` branch from 3d785b2 to d3e631d Compare May 8, 2026 08:23
@owenowenisme
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the KillWorkersCallback signature to include a trigger_reason string, allowing for more descriptive OOM kill messages across the system. The changes propagate through the event, pressure, and threshold memory monitors, as well as the NodeManager. A potential division by zero was identified in the ThresholdMemoryMonitor when calculating the memory usage percentage for the log message.

Comment thread src/ray/common/threshold_memory_monitor.cc Outdated
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! This looks a lot cleaner. I left some minor nits/personal preferences.

Comment thread src/ray/raylet/node_manager.cc Outdated
float computed_threshold_fraction =
static_cast<float>(computed_threshold_bytes) /
static_cast<float>(total_memory_bytes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a lot cleaner!

Comment thread src/ray/raylet/node_manager.cc Outdated
Comment thread src/ray/common/event_memory_monitor.cc Outdated
Comment thread src/ray/common/pressure_memory_monitor.cc Outdated
Comment thread src/ray/common/memory_monitor_interface.h Outdated
owenowenisme and others added 2 commits May 13, 2026 12:07
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 67b09d4. Configure here.

Comment thread src/ray/raylet/node_manager.cc
owenowenisme and others added 3 commits May 13, 2026 12:45
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the oom messages more clear. One last thing, could you add an example of what the new OOM message looks like in the PR description.

Will approve once premerge passes.

Comment thread src/ray/common/tests/event_memory_monitor_test.cc Outdated
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

@edoakes edoakes merged commit 76fca96 into ray-project:master May 14, 2026
6 checks passed
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…oject#62948)

## Description
The OOM kill message re-computed the memory threshold at kill time via
`GetMemoryThreshold()`, which under `--enable-resource-isolation` could
read a different cgroup `memory.max` value than at init time (e.g. "max"
instead of digits), silently falling back to 0.95. Fix: have each
monitor pass the threshold it actually fired on through the callback
instead of letting NodeManager guess.

Now the OOM msg should be something like this

```
  Memory on the node (IP: 10.0.0.1, ID: abc123) was 7.20GB / 8.00GB (0.900000);
  OOM kill reason: Memory usage 7728742400B exceeded threshold of 6871947674B (85.9% of 8589934592B total);
  Object store memory usage: [...];
  Ray killed 1 worker(s) based on the killing policy: [...];
```

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
alexandrplashchinsky pushed a commit to alexandrplashchinsky/ray-alex that referenced this pull request May 29, 2026
…oject#62948)

## Description
The OOM kill message re-computed the memory threshold at kill time via
`GetMemoryThreshold()`, which under `--enable-resource-isolation` could
read a different cgroup `memory.max` value than at init time (e.g. "max"
instead of digits), silently falling back to 0.95. Fix: have each
monitor pass the threshold it actually fired on through the callback
instead of letting NodeManager guess.

Now the OOM msg should be something like this

```
  Memory on the node (IP: 10.0.0.1, ID: abc123) was 7.20GB / 8.00GB (0.900000);
  OOM kill reason: Memory usage 7728742400B exceeded threshold of 6871947674B (85.9% of 8589934592B total);
  Object store memory usage: [...];
  Ray killed 1 worker(s) based on the killing policy: [...];
```

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

4 participants