fix(vllm): use max prompt length for batch context-length check#1209
Closed
JKDasondee wants to merge 1 commit intohuggingface:mainfrom
Closed
fix(vllm): use max prompt length for batch context-length check#1209JKDasondee wants to merge 1 commit intohuggingface:mainfrom
JKDasondee wants to merge 1 commit intohuggingface:mainfrom
Conversation
context_size was computed as len(inputs[0]), checking only the first prompt in the batch. Any prompt longer than the first would bypass truncation, causing vLLM to receive sequences exceeding max_model_len. Fixes huggingface#1204.
Author
|
Closing — #1205 addresses the same issue with broader improvements. Sorry for the duplicate. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
VLLMModel._greedy_until, the context-length check before truncation usedlen(inputs[0])— the length of only the first prompt in the batch — instead of the maximum length across all prompts. For batches with variable-length prompts, any prompt longer than the first would silently bypass truncation and get passed to vLLM with a token count exceedingmax_model_len, causing runtime errors or silent truncation inside the engine.The fix replaces
len(inputs[0])withmax(len(inp) for inp in inputs)so the check is conservative over the entire batch, and updates the related warning messages to reflect that the reported size is the batch maximum.Fixes #1204.