Skip to content

[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101

Closed
ryanaoleary wants to merge 9 commits into
ray-project:masterfrom
ryanaoleary:update-tpu-ray-remote-options
Closed

[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101
ryanaoleary wants to merge 9 commits into
ray-project:masterfrom
ryanaoleary:update-tpu-ray-remote-options

Conversation

@ryanaoleary
Copy link
Copy Markdown
Contributor

@ryanaoleary ryanaoleary commented May 4, 2026

Description

This PR is a follow-up to #62941. I realized we don't need explicit resource requests in the ray remote options for TPU because the PG is already provisioned, and the ray.remote call in _prepare_engine_config uses PlacementGroupSchedulingStrategy. The current code requests fractional TPUs which triggers Ray Core validation errors, and it can cause a resource conflict/deadlock if the remote task doesn't release the TPU resource before the workers start to schedule. We now pass an empty resource dict and pin the task using a label_selector.

This PR also includes several other structural fixes I found necessary when testing the latest vllm-tpu image with Ray and Gemma 4. These are:

  • Removed deferred placement groups logic for TPUs in favor of appending required bundle_label_selectors for TPU. Deferring the PG meant the replica actor could schedule on the head node, which broke node-local model loading and env var assumptions in tpu_inference. In some runs, the Serve replica would schedule to the CPU head pod - causing errors. We now instead inject ray.io/tpu-topology and ray.io/accelerator-type labels into the deployment options, and _default_create_placement_group intercepts these to provision a SlicePlacementGroup upfront.

  • Fixed libtpu lockfile crashes via per-host bundles - TPUAccelerator.default_bundles() was previously defaulting to 1 TPU per bundle. It now dynamically calculates and returns bundles grouped by chips_per_host - per the examples I see in tpu_inference this will be the common pattern. Users can still request chip-level bundles by overriding bundle_per_worker.

  • Pydantic v2 incompatibility in llm_server: I was seeing ValueError: "ChatCompletionRequest" object has no field "request_id" when sending a request to the model server. It was fixed by wrapping the request.request_id = request_id injection in a try/except ValueError: pass block to respect strict OpenAI API schemas.

  • Removed __del__ from TPUAccelerator because it resulted in nondeterministic garbage collection of the Slice PG when it was still being used (e.g., during child download tasks). We should just rely on shutdown being called when the replica scales down gracefully.

  • Update SlicePlacementGroup to respect user passed in bundle_label_selector and merge it with the dynamic TPU slice specific selectors.

Related issues

Additional information

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner May 4, 2026 12:41
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@jeffreywang-anyscale some more bug fixes I came across in my manual testing with Ray LLM and vllm-tpu

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TPU accelerator configuration to use label selectors, removes explicit TPU resource requests, and improves request ID handling for Pydantic v2 compatibility. It also ensures placement groups are ready before initializing the vLLM engine. Reviewers identified that the hasattr check in the request ID logic may be redundant and could prevent necessary updates. Additionally, the removal of the accelerator_type key from the TPU options results in test failures, and the use of label_selector may cause issues in non-Kubernetes environments.

Comment thread python/ray/llm/_internal/serve/core/server/llm_server.py
Comment thread python/ray/llm/tests/serve/cpu/configs/test_models.py
Comment thread python/ray/llm/tests/serve/cpu/deployments/llm/test_llm_engine_tpu.py Outdated
Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py
Comment thread python/ray/llm/_internal/serve/core/configs/accelerators.py
Comment thread python/ray/llm/_internal/serve/core/server/llm_server.py
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels May 4, 2026
…cated TPU slice

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary force-pushed the update-tpu-ray-remote-options branch from 9c63137 to 86fa1cb Compare May 5, 2026 01:46
@ryanaoleary ryanaoleary requested a review from a team as a code owner May 5, 2026 01:46
ryanaoleary and others added 2 commits May 4, 2026 19:09
Comment thread python/ray/serve/_private/utils.py Outdated
@ryanaoleary ryanaoleary changed the title [LLM] Update TPU ray remote options to use PG [LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle May 5, 2026
ryanaoleary and others added 3 commits May 5, 2026 11:05
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Comment thread python/ray/serve/_private/default_impl.py
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.


request_id = request.request_id or f"chatcmpl-{random.randint(1000, 9999)}"
request_id = (
(raw_request_info.request_id if raw_request_info else None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock accesses nonexistent request_id on RawRequestInfo

Medium Severity

RawRequestInfo is a dataclass with only a headers field — it does not have a request_id attribute. Accessing raw_request_info.request_id will raise AttributeError whenever raw_request_info is not None. This occurs in the HTTP ingress path where RawRequestInfo.from_starlette_request() creates an instance and passes it to the engine's chat method.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.

name=request.name,
lifetime="detached",
)
return slice_pg.placement_group
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SlicePG wrapper lost, head PGs never cleaned up

Medium Severity

The slice_pg wrapper created by slice_placement_group(...) is discarded after returning only slice_pg.placement_group. The wrapper holds references to internal head placement groups used to reserve the TPU slice. Since nothing stores the wrapper and SlicePlacementGroup has no __del__, the head PGs are never shut down when the deployment scales down — causing a resource leak that blocks future TPU slice reservations.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.

@jeffreywang-anyscale jeffreywang-anyscale self-assigned this May 5, 2026
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

thanks, lemme know when your PRs are ready for review :)

@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split.

thanks, lemme know when your PRs are ready for review :)

Great thanks again, here are the updated PRs:

@ryanaoleary ryanaoleary closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants