feat: allow controlling startup warmup tokens by malaiwah · Pull Request #884 · huggingface/text-embeddings-inference

malaiwah · 2026-06-25T13:13:10Z

What does this PR do?

Adds a --warmup-tokens / WARMUP_TOKENS option to control the explicit startup warmup pass.

Default behavior is unchanged:
- padded backends use min(max_input_length, max_batch_tokens)
- non-padded backends use max_batch_tokens
--warmup-tokens 0 skips the explicit warmup pass.
--warmup-tokens N runs a smaller synthetic warmup with N tokens.
Positive values above --max-batch-tokens are rejected to avoid startup-only shapes larger than production batching limits.
HPU and ROCm keep their existing backend-specific bucket warmup for positive values; 0 still skips before that warmup is entered.

This addresses #819, where CPU/ORT Qwen3 ONNX startup was dominated by the warmup pass even after lowering --max-batch-tokens.

Why

Warmup is valuable on GPU because it exercises production batch sizes and can trigger kernel compilation / memory allocation. On CPU spot deployments, especially ORT/ONNX, the same synthetic forward pass can dominate cold start time.

This lets users choose a smaller smoke-test warmup or skip the explicit warmup pass entirely without also lowering --max-batch-tokens and changing serving limits.

Validation

Static/local checks:

cargo test -p text-embeddings-backend --no-default-features --features ort,clap
cargo test -p text-embeddings-backend --no-default-features --features candle,clap
cargo check -p text-embeddings-router --no-default-features --features ort,http
cargo check -p text-embeddings-router --no-default-features --features candle,http
cargo run -p text-embeddings-router --no-default-features --features ort,http -- --help
git diff --check

Runtime timing on macOS/arm64 CPU, release-built ORT/http router, using the model from #819:

Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8

Note: current TEI could not start directly from that repo config because the config contains GPT-style alias fields (n_embd, n_layer, n_head) that parse as duplicate aliases for existing Qwen3 fields. I used a temporary local copy with only those alias keys removed; the ONNX weights and tokenizer were unchanged.

Measured process start until GET /health succeeded:

Setup	Default warmup	`--warmup-tokens 0`	`--warmup-tokens 64`
`--max-batch-tokens 1024`	`3.163s`	`1.049s`	`1.590s`
default `16384`	still warming after ~4 min, stopped manually	`1.594s`	`1.590s`

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? See Decreasing warmup for use on spot instances #819.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots? Added unit tests for warmup token selection and validation; no snapshots needed.

malaiwah · 2026-06-26T01:07:14Z

CUDA validation added on an RTX 5090 host.

Environment:

Host: Ubuntu 24.04, NVIDIA GeForce RTX 5090
Driver: 595.58.03; nvidia-smi reports CUDA 13.2
Container: podman + docker.io/nvidia/cuda:12.8.0-devel-ubuntu24.04
CUDA_COMPUTE_CAP=120

I used a validation-only RUSTFLAGS='--cfg feature="dynamic-linking"' in the container so cudarc links dynamically against the CUDA runtime available there. No repository files were changed for this.

Backend warmup tests with CUDA enabled:

cargo test -p text-embeddings-backend \
  --no-default-features \
  --features candle,cuda,clap

Result:

test result: ok. 5 passed; 0 failed

Router CUDA build check, using the backend's transitive CUDA feature because text-embeddings-router does not expose a direct cuda feature:

cargo check -p text-embeddings-router \
  --no-default-features \
  --features candle,http,text-embeddings-backend/cuda

Result:

Finished `dev` profile [unoptimized + debuginfo] target(s) in 3m 11s

CLI surface check:

cargo run -p text-embeddings-router \
  --no-default-features \
  --features candle,http,text-embeddings-backend/cuda \
  -- --help

Result: help output includes the new option:

--warmup-tokens <WARMUP_TOKENS>

So the warmup-token tests and router CLI compile cleanly with CUDA enabled.

feat: allow controlling startup warmup tokens

5806bdc

malaiwah mentioned this pull request Jun 25, 2026

Decreasing warmup for use on spot instances #819

Open

malaiwah marked this pull request as ready for review June 25, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow controlling startup warmup tokens#884

feat: allow controlling startup warmup tokens#884
malaiwah wants to merge 1 commit into
huggingface:mainfrom
malaiwah:codex/add-warmup-tokens-control

malaiwah commented Jun 25, 2026

Uh oh!

malaiwah commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

malaiwah commented Jun 25, 2026

What does this PR do?

Why

Validation

Before submitting

Uh oh!

malaiwah commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant