[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101
[LLM] Fix Multi-Host TPU gang-scheduling and replica lifecycle#63101ryanaoleary wants to merge 9 commits into
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
@jeffreywang-anyscale some more bug fixes I came across in my manual testing with Ray LLM and vllm-tpu |
There was a problem hiding this comment.
Code Review
This pull request updates the TPU accelerator configuration to use label selectors, removes explicit TPU resource requests, and improves request ID handling for Pydantic v2 compatibility. It also ensures placement groups are ready before initializing the vLLM engine. Reviewers identified that the hasattr check in the request ID logic may be redundant and could prevent necessary updates. Additionally, the removal of the accelerator_type key from the TPU options results in test failures, and the use of label_selector may cause issues in non-Kubernetes environments.
…cated TPU slice Signed-off-by: ryanaoleary <ryanaoleary@google.com>
9c63137 to
86fa1cb
Compare
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.
|
|
||
| request_id = request.request_id or f"chatcmpl-{random.randint(1000, 9999)}" | ||
| request_id = ( | ||
| (raw_request_info.request_id if raw_request_info else None) |
There was a problem hiding this comment.
Mock accesses nonexistent request_id on RawRequestInfo
Medium Severity
RawRequestInfo is a dataclass with only a headers field — it does not have a request_id attribute. Accessing raw_request_info.request_id will raise AttributeError whenever raw_request_info is not None. This occurs in the HTTP ingress path where RawRequestInfo.from_starlette_request() creates an instance and passes it to the engine's chat method.
Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.
| name=request.name, | ||
| lifetime="detached", | ||
| ) | ||
| return slice_pg.placement_group |
There was a problem hiding this comment.
SlicePG wrapper lost, head PGs never cleaned up
Medium Severity
The slice_pg wrapper created by slice_placement_group(...) is discarded after returning only slice_pg.placement_group. The wrapper holds references to internal head placement groups used to reserve the TPU slice. Since nothing stores the wrapper and SlicePlacementGroup has no __del__, the head PGs are never shut down when the deployment scales down — causing a resource leak that blocks future TPU slice reservations.
Reviewed by Cursor Bugbot for commit 4d335a0. Configure here.
|
@jeffreywang-anyscale this PR is becoming quite large and I realized there are issues with how the Serve deployment works - so I'm splitting it into smaller, targeted PRs that I'll link here. I plan to close this PR after the changes are split. |
thanks, lemme know when your PRs are ready for review :) |
Great thanks again, here are the updated PRs:
|


Description
This PR is a follow-up to #62941. I realized we don't need explicit resource requests in the ray remote options for TPU because the PG is already provisioned, and the ray.remote call in
_prepare_engine_configusesPlacementGroupSchedulingStrategy. The current code requests fractional TPUs which triggers Ray Core validation errors, and it can cause a resource conflict/deadlock if the remote task doesn't release the TPU resource before the workers start to schedule. We now pass an empty resource dict and pin the task using alabel_selector.This PR also includes several other structural fixes I found necessary when testing the latest vllm-tpu image with Ray and Gemma 4. These are:
Removed deferred placement groups logic for TPUs in favor of appending required
bundle_label_selectorsfor TPU. Deferring the PG meant the replica actor could schedule on the head node, which broke node-local model loading and env var assumptions in tpu_inference. In some runs, the Serve replica would schedule to the CPU head pod - causing errors. We now instead injectray.io/tpu-topologyandray.io/accelerator-typelabels into the deployment options, and_default_create_placement_groupintercepts these to provision aSlicePlacementGroupupfront.Fixed libtpu lockfile crashes via per-host bundles -
TPUAccelerator.default_bundles()was previously defaulting to 1 TPU per bundle. It now dynamically calculates and returns bundles grouped bychips_per_host- per the examples I see intpu_inferencethis will be the common pattern. Users can still request chip-level bundles by overridingbundle_per_worker.Pydantic v2 incompatibility in llm_server: I was seeing
ValueError: "ChatCompletionRequest" object has no field "request_id"when sending a request to the model server. It was fixed by wrapping the request.request_id = request_id injection in atry/except ValueError: passblock to respect strict OpenAI API schemas.Removed
__del__fromTPUAcceleratorbecause it resulted in nondeterministic garbage collection of the Slice PG when it was still being used (e.g., during child download tasks). We should just rely on shutdown being called when the replica scales down gracefully.Update
SlicePlacementGroupto respect user passed inbundle_label_selectorand merge it with the dynamic TPU slice specific selectors.Related issues
Additional information