[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes by ArturNiederfahrenhorst · Pull Request #63128 · ray-project/ray

ArturNiederfahrenhorst · 2026-05-05T13:33:00Z

When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, Algorithm.evaluate() previously did one of two thing:

evaluation_parallel_to_training=True: fall back to the local eval EnvRunner, which raises ValueError: Cannot run on local evaluation worker parallel to training!. Hard-crashes a long training run.
evaluation_parallel_to_training=False: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on AlgorithmConfig.evaluation() control the behavior:

evaluation_unhealthy_workers_timeout_s (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover.
evaluation_error_after_n_consecutive_skips (bool, default False): Configured if or after how many skips we raise an error.

gemini-code-assist

Code Review

This pull request introduces a mechanism to handle scenarios where all configured remote evaluation workers are unhealthy. It adds two new configuration options: evaluation_unhealthy_workers_timeout_s, which allows the algorithm to wait for worker recovery, and evaluation_error_on_no_workers, which determines whether to raise a RuntimeError or skip evaluation if no workers become healthy. The Algorithm.evaluate method has been updated to incorporate these checks, and a comprehensive test suite has been added. Feedback highlights a potential UnboundLocalError when evaluation is skipped and suggests improving the efficiency of the recovery polling loop by triggering internal health checks.

…ross modes When all *configured* remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two things: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner from the one configured. Both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (Optional[int], default None): tolerate this many consecutive evaluation iterations in which all remote eval EnvRunners are unhealthy. On the next such iteration, `evaluate()` raises `RuntimeError`. The counter resets when an evaluation step actually runs on the remote workers, so transient preemption blips don't accumulate. The threshold knob lets users distinguish transient (a few skips) from permanent (e.g. workers crash on init) failures: transient is fine, permanent escalates to Tune which can restart the trial per the trial's `max_failures` setting. Both knobs apply uniformly regardless of `evaluation_parallel_to_training`. The intentional `evaluation_num_env_runners=0` case (user explicitly asked for local-only eval) is preserved -- this is not a fallback, it's the user's chosen configuration, and is recognized via `num_remote_env_runners() == 0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ArturNiederfahrenhorst · 2026-05-07T20:40:02Z

                    kwargs=dict(algorithm=self, metrics_logger=self.metrics),
                )

+            eval_results: ResultDict = {}


Butting this here to make the code easier to read.
If we never get eval results, they will be {}, it is the default.

Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

pseudo-rnd-thoughts

Looks good to me, I just have two comments

pseudo-rnd-thoughts · 2026-05-08T15:49:39Z

+    start = time.monotonic()
+    algo.evaluate()
+    elapsed = time.monotonic() - start
+    assert elapsed >= timeout_s


Is there a value to an upper bound?

pseudo-rnd-thoughts · 2026-05-08T15:51:32Z

                    kwargs=dict(algorithm=self, metrics_logger=self.metrics),
                )

+            eval_results: ResultDict = {}


Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst · 2026-05-11T19:54:01Z

Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Thanks! I agree that it is more helpful but None is not understood downstream. It's also our default value in the step function. So I agree, but it will be a bigger change for another day I suppose.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit beeeb73. Configure here.}

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>

…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py

Comment thread rllib/algorithms/algorithm.py Outdated

ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch 3 times, most recently from 0a6b3a9 to 453923c Compare May 5, 2026 14:28

ArturNiederfahrenhorst marked this pull request as ready for review May 5, 2026 14:43

ArturNiederfahrenhorst requested a review from a team as a code owner May 5, 2026 14:43

ArturNiederfahrenhorst added rllib RLlib related issues rllib-algorithms An RLlib algorithm/Trainer is not learning. labels May 5, 2026

cursor Bot reviewed May 5, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py

ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch from 9d50b63 to 113f33e Compare May 6, 2026 14:53

ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch from 113f33e to ab3c1d3 Compare May 6, 2026 15:37

ArturNiederfahrenhorst commented May 7, 2026

View reviewed changes

fixes

60e825f

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 7, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py

re-sync weights if workers come back

32c4aea

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst added the go add ONLY when ready to merge, run all tests label May 11, 2026

ArturNiederfahrenhorst added 2 commits May 11, 2026 15:08

fix test eval testing

8202bc1

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

polish

bb16352

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

pseudo-rnd-thoughts approved these changes May 11, 2026

View reviewed changes

mark's comment

beeeb73

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread rllib/algorithms/algorithm.py

ArturNiederfahrenhorst added 2 commits May 12, 2026 09:14

cursor comment

b12da30

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

polish test

c9c9deb

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst enabled auto-merge (squash) May 12, 2026 07:39

github-actions Bot disabled auto-merge May 12, 2026 07:40

ArturNiederfahrenhorst enabled auto-merge (squash) May 12, 2026 08:36

Merge branch 'master' into fix-eval-all-workers-unhealthy

e8e94bf

github-actions Bot disabled auto-merge May 12, 2026 12:24

ArturNiederfahrenhorst merged commit 0074e78 into ray-project:master May 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes#63128

[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes#63128
ArturNiederfahrenhorst merged 9 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:fix-eval-all-workers-unhealthy

ArturNiederfahrenhorst commented May 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArturNiederfahrenhorst May 7, 2026

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Uh oh!

ArturNiederfahrenhorst commented May 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArturNiederfahrenhorst commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArturNiederfahrenhorst May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Choose a reason for hiding this comment

Uh oh!

pseudo-rnd-thoughts May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ArturNiederfahrenhorst commented May 11, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArturNiederfahrenhorst commented May 5, 2026 •

edited

Loading