[Core] deflake test_task_events_2 TestIsActorTaskRunning#63222
Conversation
Signed-off-by: sampan <sampan@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates several test cases in python/ray/tests/test_task_events_2.py by increasing the timeout for wait_for_condition calls to 30 seconds. These changes are likely intended to improve test reliability and reduce flakiness in environments where asynchronous process checks may take longer to complete. I have no feedback to provide as there were no review comments to evaluate.
|
@sampan-s-nayak do we know if setproctitle is the slow part of if it's something else? |
|
I dug a bit deeper and it looks like by the time the actor task runs, proctitle should have been set. the fact that even after 10seconds it is not set means that our change proctitle call never took effect. let me see if there is something wrong in the way we are changing proctitle. |
|
@Kunchd updated pr and description with findings, I see improvements when running the test multiple times locally but I am not sure if this will fully solve the problem (we may need to look deeper into the setproctitle code if we want to truly fix this) |
| _spt_setup_warning_logged = False | ||
| # spt_setup walks environ to recover the original argv memory region; see | ||
| # src/ray/thirdparty/setproctitle/spt_setup.c and spt_status.c. | ||
| _spt_setup_result = spt_setup() |
There was a problem hiding this comment.
I'm a little confused here. Why would invoking spt_setup outside setproctitle speed things up? Doesn't spt_setup's implementation already cache its result to ensure it only actually sets up once.
There was a problem hiding this comment.
from what I could gather from looking at the setproctitle codebase (which we vendor) it looks like calling this early helps as it collects process args env etc which we later modify from within the process (comment: https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/setproctitle/spt_status.c#L158 )
There was a problem hiding this comment.
I dont think this will fully solve the problem, but I hope it reduces the flakiness
There was a problem hiding this comment.
So looking at this run where the test was flaky. It looks like the assertion for _is_actor_task_running is flaky. Are we saying that the assertion failed because setproctitle failed to successfully find the original argv array to set the proctitle, and that's why we need to do the setup earlier?
If we want to patch this quickly, I would love if we could link a couple of CI runs showing the change deflakes the test. I couldn't repro the flake using master on my local environment.
Otherwise, it might be worth sometime to look deeper into why setproctitle is failing, or if some other thing is failing.
There was a problem hiding this comment.
failed to successfully find the original argv array to set the proctitle, and that's why we need to do the setup earlier?
yeah I feel there is a race between raylets setEnv() and setproctitle reading argv (it reads by walking back from env[0]
. if env changes after pointer is created then the backward walk may fail leading to setproctitle failure) (the pathway:spt_setup → get_argc_argv → find_argv_from_env walks backward from environ[0]. when find_argv_from_env fails, spt_setup also fails and subsequent setProctitle() calls are no-ops)
so the fix here is to call spt_setup during raylet import time so that we reduce the possibility of race. I am also improving how we handle failure. previously even when setproctitle failed we would update local raylet cache , instead I just log the failure and skip cache update.
Kunchd
left a comment
There was a problem hiding this comment.
Looks good. Please attach runs showing the test is de-flaked on CI before merging. Thanks!
Its hard to prove the deflake, when I tried reproing this it took ~6runs t get it to flake. I feel we can just merge this and then monitor and see if we notice flakes in the future. the changes themselves are not very intrusive and should not cause issues |
Bump the inner wait_for_condition(_check) timeout from the 10s default
to 30s. Under loaded CI runners (especially the [aggregator] parametrize
variant), proctitle propagation + psutil reads on /proc/PID/{comm,cmdline}
can race past 10s, causing the actor task to hang on _check and
ray.get(...) to time out.
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
7cbfd64 to
aebe6f1
Compare
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
|
the previous approach of doing the setup during import time was affecting all importers of ray, I am now using a different approach where is the walk back from environ fails then I fall back to reading the argv location from @Kunchd could you please review again |
Kunchd
left a comment
There was a problem hiding this comment.
The changes looks reasonable. But verifying pointer arithmetic by looking is never as good as actually running it. Please add tests for the added logic.
| exit: | ||
| if (buf) { free(buf); } | ||
| return rv; | ||
| } |
There was a problem hiding this comment.
We're adding a lot to the vendored logic at this point, Could we add a test to make sure this code path actually works?
There was a problem hiding this comment.
Sure, but this path will only trigger when the original fails so it's largely just a fallback
There was a problem hiding this comment.
added a test, merging now
Verifies that ray._raylet.setproctitle() still updates /proc/self/cmdline when environ has been cleared, exercising the find_argv_from_proc_stat() fallback in vendored setproctitle. Without the fallback, this scenario would silently no-op. Signed-off-by: sampan <sampan@anyscale.com>
setproctile sometimes fails to update proctitle (but the failure is ignored and raylet updates its internal cache). skipping cache update in case of setproctitle failures, also calling
spt_setup()early to try and make it more reliable (The vendored setproctitle implementation documents that argv/environ setup should happen early, and its setup path may recover argv by walking backward from environ. This change moves setup to _raylet import time, before worker startup does more process initialization.)