Skip to content

feat: allow controlling startup warmup tokens#884

Open
malaiwah wants to merge 1 commit into
huggingface:mainfrom
malaiwah:codex/add-warmup-tokens-control
Open

feat: allow controlling startup warmup tokens#884
malaiwah wants to merge 1 commit into
huggingface:mainfrom
malaiwah:codex/add-warmup-tokens-control

Conversation

@malaiwah

Copy link
Copy Markdown

What does this PR do?

Adds a --warmup-tokens / WARMUP_TOKENS option to control the explicit startup warmup pass.

  • Default behavior is unchanged:
    • padded backends use min(max_input_length, max_batch_tokens)
    • non-padded backends use max_batch_tokens
  • --warmup-tokens 0 skips the explicit warmup pass.
  • --warmup-tokens N runs a smaller synthetic warmup with N tokens.
  • Positive values above --max-batch-tokens are rejected to avoid startup-only shapes larger than production batching limits.
  • HPU and ROCm keep their existing backend-specific bucket warmup for positive values; 0 still skips before that warmup is entered.

This addresses #819, where CPU/ORT Qwen3 ONNX startup was dominated by the warmup pass even after lowering --max-batch-tokens.

Why

Warmup is valuable on GPU because it exercises production batch sizes and can trigger kernel compilation / memory allocation. On CPU spot deployments, especially ORT/ONNX, the same synthetic forward pass can dominate cold start time.

This lets users choose a smaller smoke-test warmup or skip the explicit warmup pass entirely without also lowering --max-batch-tokens and changing serving limits.

Validation

Static/local checks:

cargo test -p text-embeddings-backend --no-default-features --features ort,clap
cargo test -p text-embeddings-backend --no-default-features --features candle,clap
cargo check -p text-embeddings-router --no-default-features --features ort,http
cargo check -p text-embeddings-router --no-default-features --features candle,http
cargo run -p text-embeddings-router --no-default-features --features ort,http -- --help
git diff --check

Runtime timing on macOS/arm64 CPU, release-built ORT/http router, using the model from #819:

Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8

Note: current TEI could not start directly from that repo config because the config contains GPT-style alias fields (n_embd, n_layer, n_head) that parse as duplicate aliases for existing Qwen3 fields. I used a temporary local copy with only those alias keys removed; the ONNX weights and tokenizer were unchanged.

Measured process start until GET /health succeeded:

Setup Default warmup --warmup-tokens 0 --warmup-tokens 64
--max-batch-tokens 1024 3.163s 1.049s 1.590s
default 16384 still warming after ~4 min, stopped manually 1.594s 1.590s

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? See Decreasing warmup for use on spot instances #819.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots? Added unit tests for warmup token selection and validation; no snapshots needed.

@malaiwah malaiwah marked this pull request as ready for review June 25, 2026 14:14
@malaiwah

Copy link
Copy Markdown
Author

CUDA validation added on an RTX 5090 host.

Environment:

Host: Ubuntu 24.04, NVIDIA GeForce RTX 5090
Driver: 595.58.03; nvidia-smi reports CUDA 13.2
Container: podman + docker.io/nvidia/cuda:12.8.0-devel-ubuntu24.04
CUDA_COMPUTE_CAP=120

I used a validation-only RUSTFLAGS='--cfg feature="dynamic-linking"' in the container so cudarc links dynamically against the CUDA runtime available there. No repository files were changed for this.

Backend warmup tests with CUDA enabled:

cargo test -p text-embeddings-backend \
  --no-default-features \
  --features candle,cuda,clap

Result:

test result: ok. 5 passed; 0 failed

Router CUDA build check, using the backend's transitive CUDA feature because text-embeddings-router does not expose a direct cuda feature:

cargo check -p text-embeddings-router \
  --no-default-features \
  --features candle,http,text-embeddings-backend/cuda

Result:

Finished `dev` profile [unoptimized + debuginfo] target(s) in 3m 11s

CLI surface check:

cargo run -p text-embeddings-router \
  --no-default-features \
  --features candle,http,text-embeddings-backend/cuda \
  -- --help

Result: help output includes the new option:

--warmup-tokens <WARMUP_TOKENS>

So the warmup-token tests and router CLI compile cleanly with CUDA enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant