feat: allow controlling startup warmup tokens#884
Open
malaiwah wants to merge 1 commit into
Open
Conversation
Author
|
CUDA validation added on an RTX 5090 host. Environment: I used a validation-only Backend warmup tests with CUDA enabled: cargo test -p text-embeddings-backend \
--no-default-features \
--features candle,cuda,clapResult: Router CUDA build check, using the backend's transitive CUDA feature because cargo check -p text-embeddings-router \
--no-default-features \
--features candle,http,text-embeddings-backend/cudaResult: CLI surface check: cargo run -p text-embeddings-router \
--no-default-features \
--features candle,http,text-embeddings-backend/cuda \
-- --helpResult: help output includes the new option: So the warmup-token tests and router CLI compile cleanly with CUDA enabled. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a
--warmup-tokens/WARMUP_TOKENSoption to control the explicit startup warmup pass.min(max_input_length, max_batch_tokens)max_batch_tokens--warmup-tokens 0skips the explicit warmup pass.--warmup-tokens Nruns a smaller synthetic warmup withNtokens.--max-batch-tokensare rejected to avoid startup-only shapes larger than production batching limits.0still skips before that warmup is entered.This addresses #819, where CPU/ORT Qwen3 ONNX startup was dominated by the warmup pass even after lowering
--max-batch-tokens.Why
Warmup is valuable on GPU because it exercises production batch sizes and can trigger kernel compilation / memory allocation. On CPU spot deployments, especially ORT/ONNX, the same synthetic forward pass can dominate cold start time.
This lets users choose a smaller smoke-test warmup or skip the explicit warmup pass entirely without also lowering
--max-batch-tokensand changing serving limits.Validation
Static/local checks:
Runtime timing on macOS/arm64 CPU, release-built ORT/http router, using the model from #819:
Note: current TEI could not start directly from that repo config because the config contains GPT-style alias fields (
n_embd,n_layer,n_head) that parse as duplicate aliases for existing Qwen3 fields. I used a temporary local copy with only those alias keys removed; the ONNX weights and tokenizer were unchanged.Measured process start until
GET /healthsucceeded:--warmup-tokens 0--warmup-tokens 64--max-batch-tokens 10243.163s1.049s1.590s163841.594s1.590sBefore submitting
instasnapshots? Added unit tests for warmup token selection and validation; no snapshots needed.