[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes#63128
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to handle scenarios where all configured remote evaluation workers are unhealthy. It adds two new configuration options: evaluation_unhealthy_workers_timeout_s, which allows the algorithm to wait for worker recovery, and evaluation_error_on_no_workers, which determines whether to raise a RuntimeError or skip evaluation if no workers become healthy. The Algorithm.evaluate method has been updated to incorporate these checks, and a comprehensive test suite has been added. Feedback highlights a potential UnboundLocalError when evaluation is skipped and suggests improving the efficiency of the recovery polling loop by triggering internal health checks.
0a6b3a9 to
453923c
Compare
9d50b63 to
113f33e
Compare
…ross modes When all *configured* remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two things: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner from the one configured. Both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (Optional[int], default None): tolerate this many consecutive evaluation iterations in which all remote eval EnvRunners are unhealthy. On the next such iteration, `evaluate()` raises `RuntimeError`. The counter resets when an evaluation step actually runs on the remote workers, so transient preemption blips don't accumulate. The threshold knob lets users distinguish transient (a few skips) from permanent (e.g. workers crash on init) failures: transient is fine, permanent escalates to Tune which can restart the trial per the trial's `max_failures` setting. Both knobs apply uniformly regardless of `evaluation_parallel_to_training`. The intentional `evaluation_num_env_runners=0` case (user explicitly asked for local-only eval) is preserved -- this is not a fallback, it's the user's chosen configuration, and is recognized via `num_remote_env_runners() == 0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
113f33e to
ab3c1d3
Compare
| kwargs=dict(algorithm=self, metrics_logger=self.metrics), | ||
| ) | ||
|
|
||
| eval_results: ResultDict = {} |
There was a problem hiding this comment.
Butting this here to make the code easier to read.
If we never get eval results, they will be {}, it is the default.
There was a problem hiding this comment.
Is {} the correct value for a failure? Is None a more helpful value to detect issues?
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Looks good to me, I just have two comments
| start = time.monotonic() | ||
| algo.evaluate() | ||
| elapsed = time.monotonic() - start | ||
| assert elapsed >= timeout_s |
There was a problem hiding this comment.
Is there a value to an upper bound?
| kwargs=dict(algorithm=self, metrics_logger=self.metrics), | ||
| ) | ||
|
|
||
| eval_results: ResultDict = {} |
There was a problem hiding this comment.
Is {} the correct value for a failure? Is None a more helpful value to detect issues?
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Thanks! I agree that it is more helpful but None is not understood downstream. It's also our default value in the |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit beeeb73. Configure here.
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ross modes (ray-project#63128) When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, `Algorithm.evaluate()` previously did one of two thing: - `evaluation_parallel_to_training=True`: fall back to the local eval EnvRunner, which raises `ValueError: Cannot run on local evaluation worker parallel to training!`. Hard-crashes a long training run. - `evaluation_parallel_to_training=False`: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results. With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on `AlgorithmConfig.evaluation()` control the behavior: - `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover. - `evaluation_error_after_n_consecutive_skips` (bool, default False): Configured if or after how many skips we raise an error. --------- Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>

When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step,
Algorithm.evaluate()previously did one of two thing:evaluation_parallel_to_training=True: fall back to the local eval EnvRunner, which raisesValueError: Cannot run on local evaluation worker parallel to training!. Hard-crashes a long training run.evaluation_parallel_to_training=False: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results.With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on
AlgorithmConfig.evaluation()control the behavior:evaluation_unhealthy_workers_timeout_s(float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover.evaluation_error_after_n_consecutive_skips(bool, default False): Configured if or after how many skips we raise an error.