Skip to content

[serve][3/3] Expose choose_replica/dispatch on deployment handles#63255

Merged
kouroshHakha merged 8 commits into
masterfrom
decouple-routing-primitives-3
May 12, 2026
Merged

[serve][3/3] Expose choose_replica/dispatch on deployment handles#63255
kouroshHakha merged 8 commits into
masterfrom
decouple-routing-primitives-3

Conversation

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

@jeffreywang-anyscale jeffreywang-anyscale commented May 9, 2026

Description

Plumb choose_replica and dispatch through DeploymentHandle, not just the underlying AsyncioRouter which #63254 ends at.

Deployment handle

  • DeploymentHandle.choose_replica() / dispatch(selection, ...) validates that the selection belongs to this deployment.

SingletonThreadRouter / CurrentLoopRouter

  • SingletonThreadRouter.choose_replica: bridges the async context manager onto the router thread via run_coroutine_threadsafe, so all state mutations happen on the singleton router loop.
  • SingletonThreadRouter.dispatch and CurrentLoopRouter.dispatch: both call selection._mark_dispatched() synchronously before scheduling the dispatch coroutine, so the choose_replica __aexit__ cleanup can't race with the dispatch task. (The race would otherwise double-decrement the queue-length cache on CurrentLoopRouter, where __aexit__ runs synchronously without yielding.)
  • CurrentLoopRouter.choose_replica: thin async-with passthrough since it shares an event loop with the caller.

Test-only router

  • LocalRouter placeholder in local_testing_mode: choose_replica / dispatch raise NotImplementedError because the local-testing path invokes user callables directly and doesn't exercise router selection.

Benchmark

The overhead of choose_replica() + dispatch() over remote() is constant. This is consistent with the one extra run_coroutine_threadsafe round-trip: remote does one thread crossing, choose_dispatch does two (one for the __aenter__ bridge, one for the _dispatch_to_marked_selection task). Replica count doesn't change the delta, as the overhead is in the bridge, not routing.

Screenshot 2026-05-11 at 7 38 36 PM

Related issues

RFC: #59792
Original PR: #60865
Next PR: Metric changes #62356

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a two-step routing mechanism in Ray Serve, adding choose_replica and dispatch methods to the Router interface and DeploymentHandle. This allows for explicit replica selection and slot reservation prior to request execution, supporting both standard and streaming workflows. The implementation includes logic to bridge these calls across event loops in SingletonThreadRouter and provides validation to ensure selections are dispatched to the correct deployment. Feedback identifies a critical resource leak vulnerability in SingletonThreadRouter.choose_replica where cancellation during context entry could result in leaked request slots. It is recommended to wrap the context entry in a try...finally block to ensure proper cleanup even if the task is cancelled early.

Comment thread python/ray/serve/_private/router.py Outdated
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-2 branch from 82037f1 to f1a3114 Compare May 9, 2026 07:39
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-3 branch from e4b95cb to 905f7df Compare May 9, 2026 08:13
@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label May 9, 2026
@jeffreywang-anyscale jeffreywang-anyscale changed the title [Serve] Expose choose_replica/dispatch on deployment handles [serve][3/3] Expose choose_replica/dispatch on deployment handles May 9, 2026
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-3 branch from 905f7df to 14b4d44 Compare May 10, 2026 03:30
@jeffreywang-anyscale jeffreywang-anyscale marked this pull request as ready for review May 10, 2026 04:54
@jeffreywang-anyscale jeffreywang-anyscale requested a review from a team as a code owner May 10, 2026 04:54
Comment thread python/ray/serve/_private/router.py Outdated
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 10, 2026
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-2 branch from 8db5cab to 6afface Compare May 10, 2026 08:10
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-3 branch from 873ecc4 to 874103a Compare May 11, 2026 03:12
self._asyncio_router._dispatch_to_marked_selection(
selection, request_meta, *request_args, **request_kwargs
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race causes double-fire of on_request_completed callback

Medium Severity

CurrentLoopRouter.dispatch and SingletonThreadRouter.dispatch schedule _dispatch_to_marked_selection as a task/future that hasn't executed yet when they return. If the choose_replica context exits before that task completes, the AsyncioRouter.choose_replica finally block sees _dispatched=True but _completion_callback_registered=False, and fires on_request_completed. Later, when the dispatch task actually runs and registers its done-callback, on_request_completed fires a second time upon request completion — violating the "exactly once" invariant stated in the comment.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 874103a. Configure here.

Base automatically changed from decouple-routing-primitives-2 to master May 11, 2026 06:12
jeffreywang-anyscale and others added 3 commits May 11, 2026 07:05
Co-Authored-By: machichima <nary12321@gmail.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale force-pushed the decouple-routing-primitives-3 branch from 874103a to e95f4a6 Compare May 11, 2026 07:07
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the design is sound — the slot-reservation contract, the _mark_dispatched synchronous flag to prevent the choose_replica finally from racing the dispatch task, and the _completion_callback_registered exactly-once invariant are all well-thought-out. The CurrentLoopRouter wrapper is clean. One critical resource safety issue in the SingletonThreadRouter bridge needs fixing before merge; everything else is minor.

The open Gemini bot comment (on SingletonThreadRouter.choose_replica) is still unaddressed — I've investigated it and believe it's valid; details in the inline comment.

Note

This review was co-written with AI assistance (Claude Code).

Comment thread python/ray/serve/_private/router.py Outdated
Comment thread python/ray/serve/_private/router.py Outdated
Comment thread python/ray/serve/handle.py
Comment thread python/ray/serve/handle.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…text() completes

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Comment thread python/ray/serve/handle.py
Comment thread python/ray/serve/handle.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/router.py Outdated
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3a4056c. Configure here.

Comment thread python/ray/serve/handle.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@kouroshHakha
Copy link
Copy Markdown
Contributor

Before we merge, I'd like to understand the overhead of using choose_replica() + dispatch() as a drop-in for .remote(). Your other LLM benchmarks suggest parity, but it would be good to confirm that here with a small Serve microbenchmark.

The interesting case is SingletonThreadRouter (handle called from outside a deployment): .remote() does one run_coroutine_threadsafe round-trip, whereas choose_replica + dispatch does two — one for __aenter__ and one for _dispatch_to_marked_selection. If those thread crossings dominate routing latency you could see a meaningful delta.

Concretely, could you add a choose_dispatch mode to python/ray/serve/_private/benchmarks/handle_noop_latency.py — something like a second path in Benchmarker that does:

async with self._handle.choose_replica() as sel:
    await self._handle.dispatch(sel)

and run both modes (remote vs choose_dispatch) at e.g. num_replicas=1 and num_replicas=4? Even a single table of p50/p99 latency numbers pasted in the PR description would be enough — the goal is just to establish a documented baseline so that if someone later refactors assign_request to use these primitives, there's a clear regression target to compare against.

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor Author

Going to follow up nightly benchmarks with #63293.

Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so we cannot just replace .remote() with choose + dispatch.

@kouroshHakha kouroshHakha merged commit cf7ae60 into master May 12, 2026
6 checks passed
@kouroshHakha kouroshHakha deleted the decouple-routing-primitives-3 branch May 12, 2026 03:16
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…ray-project#63255)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: machichima <nary12321@gmail.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…ray-project#63255)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: machichima <nary12321@gmail.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ray-project#63255)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: machichima <nary12321@gmail.com>
alexandrplashchinsky pushed a commit to alexandrplashchinsky/ray-alex that referenced this pull request May 29, 2026
…ray-project#63255)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: machichima <nary12321@gmail.com>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
abrarsheikh pushed a commit that referenced this pull request Jun 1, 2026
)

## Description

Following up on
#63255 (comment),
we'd like to show the delta between `choose_replica + dispatch` and
`remote` in our DB dashboards.

**Release test results** -- latencies in ms
(https://buildkite.com/ray-project/release/builds/92517/canvas?sid=019e1a3c-642f-4a42-867b-2818577272b6&tab=output)
| Percentile | `remote` | `choose_replica + dispatch` | Delta |
|---|---|---|---|
| p50 | 1.040 | 1.784 | +0.74 |
| p90 | 1.149 | 1.922 | +0.77 |
| p95 | 1.199 | 1.970 | +0.77 |
| p99 | 1.437 | 2.137 | +0.70 |

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants