[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle by ryanaoleary · Pull Request #63101 · ray-project/ray

ryanaoleary · 2026-05-04T12:41:37Z

Description

This PR is a follow-up to #62941. I realized we don't need explicit resource requests in the ray remote options for TPU because the PG is already provisioned, and the ray.remote call in _prepare_engine_config uses PlacementGroupSchedulingStrategy. The current code requests fractional TPUs which triggers Ray Core validation errors, and it can cause a resource conflict/deadlock if the remote task doesn't release the TPU resource before the workers start to schedule. We now pass an empty resource dict and pin the task using a label_selector.

This PR also includes several other structural fixes I found necessary when testing the latest vllm-tpu image with Ray and Gemma 4. These are:

Removed deferred placement groups logic for TPUs in favor of appending required bundle_label_selectors for TPU. Deferring the PG meant the replica actor could schedule on the head node, which broke node-local model loading and env var assumptions in tpu_inference. In some runs, the Serve replica would schedule to the CPU head pod - causing errors. We now instead inject ray.io/tpu-topology and ray.io/accelerator-type labels into the deployment options, and _default_create_placement_group intercepts these to provision a SlicePlacementGroup upfront.
Fixed libtpu lockfile crashes via per-host bundles - TPUAccelerator.default_bundles() was previously defaulting to 1 TPU per bundle. It now dynamically calculates and returns bundles grouped by chips_per_host - per the examples I see in tpu_inference this will be the common pattern. Users can still request chip-level bundles by overriding bundle_per_worker.
Pydantic v2 incompatibility in llm_server: I was seeing ValueError: "ChatCompletionRequest" object has no field "request_id" when sending a request to the model server. It was fixed by wrapping the request.request_id = request_id injection in a try/except ValueError: pass block to respect strict OpenAI API schemas.
Removed __del__ from TPUAccelerator because it resulted in nondeterministic garbage collection of the Slice PG when it was still being used (e.g., during child download tasks). We should just rely on shutdown being called when the replica scales down gracefully.
Update SlicePlacementGroup to respect user passed in bundle_label_selector and merge it with the dynamic TPU slice specific selectors.

Related issues

Additional information

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary · 2026-05-04T12:42:06Z

@jeffreywang-anyscale some more bug fixes I came across in my manual testing with Ray LLM and vllm-tpu

gemini-code-assist

Code Review

This pull request updates the TPU accelerator configuration to use label selectors, removes explicit TPU resource requests, and improves request ID handling for Pydantic v2 compatibility. It also ensures placement groups are ready before initializing the vLLM engine. Reviewers identified that the hasattr check in the request ID logic may be redundant and could prevent necessary updates. Additionally, the removal of the accelerator_type key from the TPU options results in test failures, and the use of label_selector may cause issues in non-Kubernetes environments.

…cated TPU slice Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.}

cursor · 2026-05-05T11:57:27Z


-        request_id = request.request_id or f"chatcmpl-{random.randint(1000, 9999)}"
+        request_id = (
+            (raw_request_info.request_id if raw_request_info else None)


Mock accesses nonexistent request_id on RawRequestInfo

Medium Severity

RawRequestInfo is a dataclass with only a headers field — it does not have a request_id attribute. Accessing raw_request_info.request_id will raise AttributeError whenever raw_request_info is not None. This occurs in the HTTP ingress path where RawRequestInfo.from_starlette_request() creates an instance and passes it to the engine's chat method.

^{Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.}

cursor · 2026-05-05T11:57:28Z

+            name=request.name,
+            lifetime="detached",
+        )
+        return slice_pg.placement_group


SlicePG wrapper lost, head PGs never cleaned up

Medium Severity

The slice_pg wrapper created by slice_placement_group(...) is discarded after returning only slice_pg.placement_group. The wrapper holds references to internal head placement groups used to reserve the TPU slice. Since nothing stores the wrapper and SlicePlacementGroup has no __del__, the head PGs are never shut down when the deployment scales down — causing a resource leak that blocks future TPU slice reservations.

^{Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.}

ryanaoleary · 2026-05-07T00:28:02Z

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

jeffreywang-anyscale · 2026-05-07T00:33:16Z

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

thanks, lemme know when your PRs are ready for review :)

ryanaoleary · 2026-05-07T04:03:29Z

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

thanks, lemme know when your PRs are ready for review :)

Great thanks again, here are the updated PRs:

[LLM] Fix for pydantic v2 - wrap request.request_id assignment in try/except #63169 - pydantic fix
[Core][TPU] Improve lifecycle handling of SlicePlacementGroup and support explicit bundle_label_selector #63171 - fix in the tpu util for the clean-up logic and an unrelated issue I found, I already asked Andrew to review this one
[LLM] Add per-host bundles default and fix fractional TPUs in default bundles for TPUAccelerator in Ray LLM #63177 - fix the Ray core validation error for the tpu resource in get_remote_options and default to per-host bundles, I think this one is the most urgent and could be merged standalone of the prior 2
[Serve][2/N] Implement AcceleratorConfig to enable custom scheduling logic for accelerators with Serve deployments #63179 - add an AcceleratorConfig class to Serve to enable creating the accelerator-specific placement group there rather than attempting to defer it to LLMConfig when the replica starts

[LLM] Update TPU ray remote options to use PG

2585698

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary requested a review from a team as a code owner May 4, 2026 12:41

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py

Comment thread python/ray/llm/_internal/serve/core/server/llm_server.py

ray-gardener Bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 4, 2026

Remove deferred PG logic and ensure Serve replica actor runs on co-lo…

86fa1cb

…cated TPU slice Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary force-pushed the update-tpu-ray-remote-options branch from 9c63137 to 86fa1cb Compare May 5, 2026 01:46

ryanaoleary requested a review from a team as a code owner May 5, 2026 01:46

ryanaoleary and others added 2 commits May 4, 2026 19:09

Merge branch 'master' into update-tpu-ray-remote-options

58d9594

Fix tests and rename function

8b9284d

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

cursor Bot reviewed May 5, 2026

View reviewed changes

Comment thread python/ray/serve/_private/utils.py Outdated

ryanaoleary changed the title ~~[LLM] Update TPU ray remote options to use PG~~ [LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle May 5, 2026

ryanaoleary and others added 3 commits May 5, 2026 11:05

Ensure TPU bundles do not copy CPU driver bundle

1bfbd1e

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Merge branch 'master' into update-tpu-ray-remote-options

b9d5088

Rename config functions to be more descriptive

6da9299

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

cursor Bot reviewed May 5, 2026

View reviewed changes

Comment thread python/ray/serve/_private/default_impl.py

ryanaoleary added 2 commits May 5, 2026 11:19

fix comments and var names to be more accurate/descriptive

70d0635

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Add detached lifetime arg to slice pg creation

4d335a0

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

cursor Bot reviewed May 5, 2026

View reviewed changes

jeffreywang-anyscale self-assigned this May 5, 2026

ryanaoleary closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101

[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101
ryanaoleary wants to merge 9 commits into
ray-project:masterfrom
ryanaoleary:update-tpu-ray-remote-options

ryanaoleary commented May 4, 2026 •

edited

Loading

Uh oh!

ryanaoleary commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 5, 2026

Uh oh!

cursor Bot May 5, 2026

Uh oh!

ryanaoleary commented May 7, 2026

Uh oh!

jeffreywang-anyscale commented May 7, 2026

Uh oh!

ryanaoleary commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryanaoleary commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

ryanaoleary commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 5, 2026

Choose a reason for hiding this comment

Mock accesses nonexistent request_id on RawRequestInfo

Uh oh!

cursor Bot May 5, 2026

Choose a reason for hiding this comment

SlicePG wrapper lost, head PGs never cleaned up

Uh oh!

ryanaoleary commented May 7, 2026

Uh oh!

jeffreywang-anyscale commented May 7, 2026

Uh oh!

ryanaoleary commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryanaoleary commented May 4, 2026 •

edited

Loading

Mock accesses nonexistent `request_id` on `RawRequestInfo`