Skip to content

[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes#63128

Merged
ArturNiederfahrenhorst merged 9 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:fix-eval-all-workers-unhealthy
May 12, 2026
Merged

[RLlib] Handle the all-evaluation-workers-unhealthy case uniformly across modes#63128
ArturNiederfahrenhorst merged 9 commits into
ray-project:masterfrom
ArturNiederfahrenhorst:fix-eval-all-workers-unhealthy

Conversation

@ArturNiederfahrenhorst
Copy link
Copy Markdown
Contributor

@ArturNiederfahrenhorst ArturNiederfahrenhorst commented May 5, 2026

When all remote evaluation EnvRunners are unhealthy at the start of an evaluation step, Algorithm.evaluate() previously did one of two thing:

  • evaluation_parallel_to_training=True: fall back to the local eval EnvRunner, which raises ValueError: Cannot run on local evaluation worker parallel to training!. Hard-crashes a long training run.
  • evaluation_parallel_to_training=False: silently fall back to the local eval EnvRunner. "Works" but the eval numbers are quietly produced by a different EnvRunner. This can be handy, but it is also silent and the local env runner may be configured differently than evaluation runners. This is an example of us trying to be clever but possibly, silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never silently falls back to local eval in the failure case anymore. Two new orthogonal config knobs on AlgorithmConfig.evaluation() control the behavior:

  • evaluation_unhealthy_workers_timeout_s (float, default 0): how long to wait for at least one remote evaluation EnvRunner to recover.
  • evaluation_error_after_n_consecutive_skips (bool, default False): Configured if or after how many skips we raise an error.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to handle scenarios where all configured remote evaluation workers are unhealthy. It adds two new configuration options: evaluation_unhealthy_workers_timeout_s, which allows the algorithm to wait for worker recovery, and evaluation_error_on_no_workers, which determines whether to raise a RuntimeError or skip evaluation if no workers become healthy. The Algorithm.evaluate method has been updated to incorporate these checks, and a comprehensive test suite has been added. Feedback highlights a potential UnboundLocalError when evaluation is skipped and suggests improving the efficiency of the recovery polling loop by triggering internal health checks.

Comment thread rllib/algorithms/algorithm.py
Comment thread rllib/algorithms/algorithm.py Outdated
@ArturNiederfahrenhorst ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch 3 times, most recently from 0a6b3a9 to 453923c Compare May 5, 2026 14:28
@ArturNiederfahrenhorst ArturNiederfahrenhorst marked this pull request as ready for review May 5, 2026 14:43
@ArturNiederfahrenhorst ArturNiederfahrenhorst requested a review from a team as a code owner May 5, 2026 14:43
@ArturNiederfahrenhorst ArturNiederfahrenhorst added rllib RLlib related issues rllib-algorithms An RLlib algorithm/Trainer is not learning. labels May 5, 2026
Comment thread rllib/algorithms/algorithm.py
@ArturNiederfahrenhorst ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch from 9d50b63 to 113f33e Compare May 6, 2026 14:53
…ross modes

When all *configured* remote evaluation EnvRunners are unhealthy at the
start of an evaluation step, `Algorithm.evaluate()` previously did one
of two things:

- `evaluation_parallel_to_training=True`: fall back to the local eval
  EnvRunner, which raises `ValueError: Cannot run on local evaluation
  worker parallel to training!`. Hard-crashes a long training run.
- `evaluation_parallel_to_training=False`: silently fall back to the
  local eval EnvRunner. "Works" but the eval numbers are quietly
  produced by a different EnvRunner from the one configured.

Both behaviors are gone. RLlib never silently falls back to local eval
in the failure case anymore. Two new orthogonal config knobs on
`AlgorithmConfig.evaluation()` control the behavior:

- `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long
  to wait for at least one remote evaluation EnvRunner to recover.
- `evaluation_error_after_n_consecutive_skips` (Optional[int], default
  None): tolerate this many consecutive evaluation iterations in which
  all remote eval EnvRunners are unhealthy. On the next such iteration,
  `evaluate()` raises `RuntimeError`. The counter resets when an
  evaluation step actually runs on the remote workers, so transient
  preemption blips don't accumulate.

The threshold knob lets users distinguish transient (a few skips) from
permanent (e.g. workers crash on init) failures: transient is fine,
permanent escalates to Tune which can restart the trial per the trial's
`max_failures` setting.

Both knobs apply uniformly regardless of `evaluation_parallel_to_training`.

The intentional `evaluation_num_env_runners=0` case (user explicitly
asked for local-only eval) is preserved -- this is not a fallback, it's
the user's chosen configuration, and is recognized via
`num_remote_env_runners() == 0`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst force-pushed the fix-eval-all-workers-unhealthy branch from 113f33e to ab3c1d3 Compare May 6, 2026 15:37
kwargs=dict(algorithm=self, metrics_logger=self.metrics),
)

eval_results: ResultDict = {}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Butting this here to make the code easier to read.
If we never get eval results, they will be {}, it is the default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Comment thread rllib/algorithms/algorithm.py
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst added the go add ONLY when ready to merge, run all tests label May 11, 2026
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Copy link
Copy Markdown
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I just have two comments

start = time.monotonic()
algo.evaluate()
elapsed = time.monotonic() - start
assert elapsed >= timeout_s
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a value to an upper bound?

kwargs=dict(algorithm=self, metrics_logger=self.metrics),
)

eval_results: ResultDict = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst
Copy link
Copy Markdown
Contributor Author

Is {} the correct value for a failure? Is None a more helpful value to detect issues?

Thanks! I agree that it is more helpful but None is not understood downstream. It's also our default value in the step function. So I agree, but it will be a bigger change for another day I suppose.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit beeeb73. Configure here.

Comment thread rllib/algorithms/algorithm.py
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
@ArturNiederfahrenhorst ArturNiederfahrenhorst enabled auto-merge (squash) May 12, 2026 07:39
@github-actions github-actions Bot disabled auto-merge May 12, 2026 07:40
@ArturNiederfahrenhorst ArturNiederfahrenhorst enabled auto-merge (squash) May 12, 2026 08:36
@github-actions github-actions Bot disabled auto-merge May 12, 2026 12:24
@ArturNiederfahrenhorst ArturNiederfahrenhorst merged commit 0074e78 into ray-project:master May 12, 2026
6 checks passed
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…ross modes (ray-project#63128)

When all remote evaluation EnvRunners are unhealthy at the start of an
evaluation step, `Algorithm.evaluate()` previously did one of two thing:

- `evaluation_parallel_to_training=True`: fall back to the local eval
EnvRunner, which raises `ValueError: Cannot run on local evaluation
worker parallel to training!`. Hard-crashes a long training run.
- `evaluation_parallel_to_training=False`: silently fall back to the
local eval EnvRunner. "Works" but the eval numbers are quietly produced
by a different EnvRunner. This can be handy, but it is also silent and
the local env runner may be configured differently than evaluation
runners. This is an example of us trying to be clever but possibly,
silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never
silently falls back to local eval in the failure case anymore. Two new
orthogonal config knobs on `AlgorithmConfig.evaluation()` control the
behavior:

- `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long
to wait for at least one remote evaluation EnvRunner to recover.
- `evaluation_error_after_n_consecutive_skips` (bool, default False):
Configured if or after how many skips we raise an error.

---------

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…ross modes (ray-project#63128)

When all remote evaluation EnvRunners are unhealthy at the start of an
evaluation step, `Algorithm.evaluate()` previously did one of two thing:

- `evaluation_parallel_to_training=True`: fall back to the local eval
EnvRunner, which raises `ValueError: Cannot run on local evaluation
worker parallel to training!`. Hard-crashes a long training run.
- `evaluation_parallel_to_training=False`: silently fall back to the
local eval EnvRunner. "Works" but the eval numbers are quietly produced
by a different EnvRunner. This can be handy, but it is also silent and
the local env runner may be configured differently than evaluation
runners. This is an example of us trying to be clever but possibly,
silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never
silently falls back to local eval in the failure case anymore. Two new
orthogonal config knobs on `AlgorithmConfig.evaluation()` control the
behavior:

- `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long
to wait for at least one remote evaluation EnvRunner to recover.
- `evaluation_error_after_n_consecutive_skips` (bool, default False):
Configured if or after how many skips we raise an error.

---------

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ross modes (ray-project#63128)

When all remote evaluation EnvRunners are unhealthy at the start of an
evaluation step, `Algorithm.evaluate()` previously did one of two thing:

- `evaluation_parallel_to_training=True`: fall back to the local eval
EnvRunner, which raises `ValueError: Cannot run on local evaluation
worker parallel to training!`. Hard-crashes a long training run.
- `evaluation_parallel_to_training=False`: silently fall back to the
local eval EnvRunner. "Works" but the eval numbers are quietly produced
by a different EnvRunner. This can be handy, but it is also silent and
the local env runner may be configured differently than evaluation
runners. This is an example of us trying to be clever but possibly,
silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never
silently falls back to local eval in the failure case anymore. Two new
orthogonal config knobs on `AlgorithmConfig.evaluation()` control the
behavior:

- `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long
to wait for at least one remote evaluation EnvRunner to recover.
- `evaluation_error_after_n_consecutive_skips` (bool, default False):
Configured if or after how many skips we raise an error.

---------

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexandrplashchinsky pushed a commit to alexandrplashchinsky/ray-alex that referenced this pull request May 29, 2026
…ross modes (ray-project#63128)

When all remote evaluation EnvRunners are unhealthy at the start of an
evaluation step, `Algorithm.evaluate()` previously did one of two thing:

- `evaluation_parallel_to_training=True`: fall back to the local eval
EnvRunner, which raises `ValueError: Cannot run on local evaluation
worker parallel to training!`. Hard-crashes a long training run.
- `evaluation_parallel_to_training=False`: silently fall back to the
local eval EnvRunner. "Works" but the eval numbers are quietly produced
by a different EnvRunner. This can be handy, but it is also silent and
the local env runner may be configured differently than evaluation
runners. This is an example of us trying to be clever but possibly,
silently, corrupting experiment results.

With the changes in this PR, both behaviors are gone. RLlib never
silently falls back to local eval in the failure case anymore. Two new
orthogonal config knobs on `AlgorithmConfig.evaluation()` control the
behavior:

- `evaluation_unhealthy_workers_timeout_s` (float, default 0): how long
to wait for at least one remote evaluation EnvRunner to recover.
- `evaluation_error_after_n_consecutive_skips` (bool, default False):
Configured if or after how many skips we raise an error.

---------

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests rllib RLlib related issues rllib-algorithms An RLlib algorithm/Trainer is not learning.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants