[Core] Fix OOM kill msg wrong threshold w/ resource isolation#62948
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the KillWorkersCallback signature to include the memory threshold that triggered the worker killing. This change allows memory monitors to propagate the specific threshold used (or a null value for event-driven monitors) directly to the NodeManager. Consequently, the NodeManager no longer needs to re-compute the threshold when generating OOM kill detail messages, leading to more accurate reporting of the triggering conditions. Various monitor implementations, tests, and the NodeManager's message formatting logic have been updated to support this new parameter. I have no feedback to provide as there are no review comments.
Kunchd
left a comment
There was a problem hiding this comment.
Thanks for helping with the kill worker call path! Like you mentioned, the disconnect between when the memory monitors decides to trigger an oom kill and when the node manager execute the killing policy to decide what workers to kill can lead to very confusing log messages.
We've actually decided to move the final decision for whether or not we decide to kill workers completely into the node manager as it has a more up-to-date view on the system memory utilization at the time of the kill callback execution compared to when the memory monitor decides to trigger the callback. This lead us to eliminate all arguments from the kill worker callback: https://github.com/ray-project/ray/blob/master/src/ray/raylet/node_manager.cc#L3076.
However, I do believe it's important to make it clear why the monitor decided to trigger the callback and how the node manager separately determined how many nodes to kill. Do you think passing a reason for kill message from the memory monitor and logging it in the OOM kill message would help with the issue you observed?
3d785b2 to
d3e631d
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the KillWorkersCallback signature to include a trigger_reason string, allowing for more descriptive OOM kill messages across the system. The changes propagate through the event, pressure, and threshold memory monitors, as well as the NodeManager. A potential division by zero was identified in the ThresholdMemoryMonitor when calculating the memory usage percentage for the log message.
Kunchd
left a comment
There was a problem hiding this comment.
Thanks for the change! This looks a lot cleaner. I left some minor nits/personal preferences.
| float computed_threshold_fraction = | ||
| static_cast<float>(computed_threshold_bytes) / | ||
| static_cast<float>(total_memory_bytes); | ||
|
|
There was a problem hiding this comment.
Nice! This is a lot cleaner!
Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 67b09d4. Configure here.
Kunchd
left a comment
There was a problem hiding this comment.
LGTM! Thanks for making the oom messages more clear. One last thing, could you add an example of what the new OOM message looks like in the PR description.
Will approve once premerge passes.
…oject#62948) ## Description The OOM kill message re-computed the memory threshold at kill time via `GetMemoryThreshold()`, which under `--enable-resource-isolation` could read a different cgroup `memory.max` value than at init time (e.g. "max" instead of digits), silently falling back to 0.95. Fix: have each monitor pass the threshold it actually fired on through the callback instead of letting NodeManager guess. Now the OOM msg should be something like this ``` Memory on the node (IP: 10.0.0.1, ID: abc123) was 7.20GB / 8.00GB (0.900000); OOM kill reason: Memory usage 7728742400B exceeded threshold of 6871947674B (85.9% of 8589934592B total); Object store memory usage: [...]; Ray killed 1 worker(s) based on the killing policy: [...]; ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
…oject#62948) ## Description The OOM kill message re-computed the memory threshold at kill time via `GetMemoryThreshold()`, which under `--enable-resource-isolation` could read a different cgroup `memory.max` value than at init time (e.g. "max" instead of digits), silently falling back to 0.95. Fix: have each monitor pass the threshold it actually fired on through the callback instead of letting NodeManager guess. Now the OOM msg should be something like this ``` Memory on the node (IP: 10.0.0.1, ID: abc123) was 7.20GB / 8.00GB (0.900000); OOM kill reason: Memory usage 7728742400B exceeded threshold of 6871947674B (85.9% of 8589934592B total); Object store memory usage: [...]; Ray killed 1 worker(s) based on the killing policy: [...]; ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

Description
The OOM kill message re-computed the memory threshold at kill time via
GetMemoryThreshold(), which under--enable-resource-isolationcould read a different cgroupmemory.maxvalue than at init time (e.g. "max" instead of digits), silently falling back to 0.95. Fix: have each monitor pass the threshold it actually fired on through the callback instead of letting NodeManager guess.Now the OOM msg should be something like this
Related issues
Additional information