Skip to content

[Core] deflake test_task_events_2 TestIsActorTaskRunning#63222

Merged
sampan-s-nayak merged 11 commits into
ray-project:masterfrom
sampan-s-nayak:deflake-test-task-events-2
Jun 2, 2026
Merged

[Core] deflake test_task_events_2 TestIsActorTaskRunning#63222
sampan-s-nayak merged 11 commits into
ray-project:masterfrom
sampan-s-nayak:deflake-test-task-events-2

Conversation

@sampan-s-nayak
Copy link
Copy Markdown
Contributor

@sampan-s-nayak sampan-s-nayak commented May 8, 2026

setproctile sometimes fails to update proctitle (but the failure is ignored and raylet updates its internal cache). skipping cache update in case of setproctitle failures, also calling spt_setup() early to try and make it more reliable (The vendored setproctitle implementation documents that argv/environ setup should happen early, and its setup path may recover argv by walking backward from environ. This change moves setup to _raylet import time, before worker startup does more process initialization.)

Signed-off-by: sampan <sampan@anyscale.com>
@sampan-s-nayak sampan-s-nayak requested a review from a team as a code owner May 8, 2026 08:48
@sampan-s-nayak sampan-s-nayak self-assigned this May 8, 2026
@sampan-s-nayak sampan-s-nayak added the go add ONLY when ready to merge, run all tests label May 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates several test cases in python/ray/tests/test_task_events_2.py by increasing the timeout for wait_for_condition calls to 30 seconds. These changes are likely intended to improve test reliability and reduce flakiness in environments where asynchronous process checks may take longer to complete. I have no feedback to provide as there were no review comments to evaluate.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label May 8, 2026
@Kunchd
Copy link
Copy Markdown
Contributor

Kunchd commented May 8, 2026

@sampan-s-nayak do we know if setproctitle is the slow part of if it's something else?

@sampan-s-nayak
Copy link
Copy Markdown
Contributor Author

I dug a bit deeper and it looks like by the time the actor task runs, proctitle should have been set. the fact that even after 10seconds it is not set means that our change proctitle call never took effect. let me see if there is something wrong in the way we are changing proctitle.

@sampan-s-nayak
Copy link
Copy Markdown
Contributor Author

@Kunchd updated pr and description with findings, I see improvements when running the test multiple times locally but I am not sure if this will fully solve the problem (we may need to look deeper into the setproctitle code if we want to truly fix this)

Comment thread python/ray/includes/setproctitle.pxi Outdated
_spt_setup_warning_logged = False
# spt_setup walks environ to recover the original argv memory region; see
# src/ray/thirdparty/setproctitle/spt_setup.c and spt_status.c.
_spt_setup_result = spt_setup()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here. Why would invoking spt_setup outside setproctitle speed things up? Doesn't spt_setup's implementation already cache its result to ensure it only actually sets up once.

Copy link
Copy Markdown
Contributor Author

@sampan-s-nayak sampan-s-nayak May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I could gather from looking at the setproctitle codebase (which we vendor) it looks like calling this early helps as it collects process args env etc which we later modify from within the process (comment: https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/setproctitle/spt_status.c#L158 )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this will fully solve the problem, but I hope it reduces the flakiness

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this run where the test was flaky. It looks like the assertion for _is_actor_task_running is flaky. Are we saying that the assertion failed because setproctitle failed to successfully find the original argv array to set the proctitle, and that's why we need to do the setup earlier?

If we want to patch this quickly, I would love if we could link a couple of CI runs showing the change deflakes the test. I couldn't repro the flake using master on my local environment.

Otherwise, it might be worth sometime to look deeper into why setproctitle is failing, or if some other thing is failing.

Copy link
Copy Markdown
Contributor Author

@sampan-s-nayak sampan-s-nayak May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed to successfully find the original argv array to set the proctitle, and that's why we need to do the setup earlier?

yeah I feel there is a race between raylets setEnv() and setproctitle reading argv (it reads by walking back from env[0]

. if env changes after pointer is created then the backward walk may fail leading to setproctitle failure) (the pathway: spt_setupget_argc_argvfind_argv_from_env walks backward from environ[0]. when find_argv_from_env fails, spt_setup also fails and subsequent setProctitle() calls are no-ops)

so the fix here is to call spt_setup during raylet import time so that we reduce the possibility of race. I am also improving how we handle failure. previously even when setproctitle failed we would update local raylet cache , instead I just log the failure and skip cache update.

@sampan-s-nayak sampan-s-nayak requested a review from Kunchd May 20, 2026 03:40
Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please attach runs showing the test is de-flaked on CI before merging. Thanks!

@sampan-s-nayak
Copy link
Copy Markdown
Contributor Author

sampan-s-nayak commented May 21, 2026

Looks good. Please attach runs showing the test is de-flaked on CI before merging. Thanks!

Its hard to prove the deflake, when I tried reproing this it took ~6runs t get it to flake. I feel we can just merge this and then monitor and see if we notice flakes in the future. the changes themselves are not very intrusive and should not cause issues

@sampan-s-nayak sampan-s-nayak enabled auto-merge (squash) May 21, 2026 05:03
@github-actions github-actions Bot disabled auto-merge May 22, 2026 04:07
sampan added 8 commits May 28, 2026 20:48
Bump the inner wait_for_condition(_check) timeout from the 10s default
to 30s. Under loaded CI runners (especially the [aggregator] parametrize
variant), proctitle propagation + psutil reads on /proc/PID/{comm,cmdline}
can race past 10s, causing the actor task to hang on _check and
ray.get(...) to time out.

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
@sampan-s-nayak sampan-s-nayak force-pushed the deflake-test-task-events-2 branch from 7cbfd64 to aebe6f1 Compare May 29, 2026 23:54
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
@sampan-s-nayak
Copy link
Copy Markdown
Contributor Author

the previous approach of doing the setup during import time was affecting all importers of ray, I am now using a different approach where is the walk back from environ fails then I fall back to reading the argv location from /proc/self/stat. this is a fallback and will only take affect when the existing approach fails so I dont expect any side effects due to this.

@Kunchd could you please review again

Copy link
Copy Markdown
Contributor

@Kunchd Kunchd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks reasonable. But verifying pointer arithmetic by looking is never as good as actually running it. Please add tests for the added logic.

exit:
if (buf) { free(buf); }
return rv;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding a lot to the vendored logic at this point, Could we add a test to make sure this code path actually works?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but this path will only trigger when the original fails so it's largely just a fallback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test, merging now

Verifies that ray._raylet.setproctitle() still updates /proc/self/cmdline
when environ has been cleared, exercising the find_argv_from_proc_stat()
fallback in vendored setproctitle. Without the fallback, this scenario
would silently no-op.

Signed-off-by: sampan <sampan@anyscale.com>
@sampan-s-nayak sampan-s-nayak merged commit 71a12c3 into ray-project:master Jun 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants