[Data][LLM] Fix max_pending_requests default to track vLLM's GPU-dependent max_num_seqs#62918
Merged
kouroshHakha merged 2 commits intoMay 14, 2026
Conversation
…ndent max_num_seqs
`vLLMEngineStageUDF` computed the default `max_pending_requests` as
`ceil(1.1 * engine_kwargs.get("max_num_seqs", 128) * pp_size)`. The
hardcoded `128` fallback does not match vLLM's actual default, which is
GPU-dependent via `AsyncEngineArgs.get_batch_defaults`:
- A10G (<70 GiB) / A100: 256
- H100 / MI300x (>=70 GiB, non-A100): 1024
- CPU: 256 * world_size
When users don't set `max_num_seqs` explicitly (the common case), the
semaphore silently caps inflight requests far below vLLM's real capacity
(e.g. ~141 vs 1024 on H100, ~14% utilization).
Move the default resolution into `vLLMEngineWrapper`, which already calls
`AsyncEngineArgs.create_engine_config()` and has access to the resolved
`scheduler_config.max_num_seqs` and `parallel_config.pipeline_parallel_size`.
The UDF passes `max_pending_requests=None` through as-is and reads the
resolved value back from the wrapper.
Behavior:
- `max_pending_requests=None` (default): auto-resolve from vLLM config
- positive int: explicit limit (unchanged)
- non-positive (e.g. -1): disable semaphore (unchanged)
This aligns with the `ProcessorConfig.max_pending_requests` field's
stated intent: "If not specified, will use the default value from the
backend engine."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors the calculation of max_pending_requests within the vLLMEngineStage. The logic for resolving the default value has been moved from the UDF to the vLLMEngineWrapper, where it now dynamically calculates the limit based on vLLM's resolved engine configuration (specifically max_num_seqs and pipeline_parallel_size). This change ensures that the request concurrency limit correctly tracks GPU-dependent capacities rather than relying on hardcoded defaults. I have no feedback to provide as there were no review comments.
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
TruongQuangPhat
pushed a commit
to cyhapun/ray-fix-issue
that referenced
this pull request
May 27, 2026
…ndent max_num_seqs (ray-project#62918) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
alexandrplashchinsky
pushed a commit
to alexandrplashchinsky/ray-alex
that referenced
this pull request
May 29, 2026
…ndent max_num_seqs (ray-project#62918) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vLLMEngineStageUDFcomputed its defaultmax_pending_requestsfromengine_kwargs.get("max_num_seqs", 128) * pp_size * 1.1. The hardcoded128fallback is stale — vLLM's actual default formax_num_seqsis GPU-dependent viaAsyncEngineArgs.get_batch_defaults(vllm/engine/arg_utils.py):When users don't set
max_num_seqsexplicitly (the common case), the semaphore silently caps inflight requests far below vLLM's real capacity — e.g. ~141 vs 1024 on H100, ~14% utilization.Fix
Move the default resolution into
vLLMEngineWrapper, which already callsAsyncEngineArgs.create_engine_config()and has access to the resolvedscheduler_config.max_num_seqsandparallel_config.pipeline_parallel_size. The UDF passesmax_pending_requests=Nonethrough and reads the resolved value back.Semantics:
max_pending_requests=None(default): auto-resolve from vLLM's resolved engine config-1): disable semaphore (unchanged)This aligns with
ProcessorConfig.max_pending_requests's stated intent: "If not specified, will use the default value from the backend engine."Test plan
test_vllm_engine_udf_basicupdated to reflect that the UDF now readsmax_pending_requestsfrom the (mocked) wrapper and passesNonethrough when the caller didn't supply a value.max_pending_requests=10are unaffected (positive int path unchanged).test_vllm_wrapper_semaphoreexercisesmax_pending_requests=2(positive int) — unaffected.🤖 Generated with Claude Code