Skip to content

Model startup takes minutes with cuda-1.9, only seconds with 1.9 #837

@slavistan

Description

@slavistan

System Info

  • os: Arch Linux x86_64 (Kernel: 6.18.9-arch1-2)
  • cpu: AMD Ryzen Threadripper 3960X
  • gpu: NVIDIA GeForce RTX 3090 (cuda 13.1, driver 590.48.01)
  • TEI docker image: cuda-1.9, 1.9

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Running the Docker example from the README.md with the cuda-1.9 image has the container stuck on Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1))) for full three minutes before the initialization continues:

model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/models

docker run                                                                   \
    --gpus all                                                               \
    -p 8080:80                                                               \
    -v $volume:/data                                                         \
    --pull always                                                            \
    ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
# cuda-1.9: Pulling from huggingface/text-embeddings-inference
# Digest: sha256:64bfb8bdd79ec1a2ef40c8fd297102b19cdeac38adf770eb9d815f4f41e4df00
# Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
# 2026-02-25T08:57:54.301715Z  INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: None, served_model_name: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "db07861e321e", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
# 2026-02-25T08:57:54.381693Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
# 2026-02-25T08:57:54.381698Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
# 2026-02-25T08:57:54.381719Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
# 2026-02-25T08:57:54.541464Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
# 2026-02-25T08:57:54.659995Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
# 2026-02-25T08:57:54.785122Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
# 2026-02-25T08:57:54.901103Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
# 2026-02-25T08:57:55.018778Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
# 2026-02-25T08:57:55.138770Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
# 2026-02-25T08:57:55.257801Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
# 2026-02-25T08:57:55.257876Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
# 2026-02-25T08:57:55.257902Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
# 2026-02-25T08:57:55.257923Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 876.23071ms
# 2026-02-25T08:57:55.516160Z  WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
# 2026-02-25T08:57:55.516174Z  WARN text_embeddings_router: router/src/lib.rs:214: The maximum input length is `32768` which exceeds `--max-batch-tokens=16384`. Input sequences will be truncated to `16384` tokens, as `--auto-truncate` is either not provided (defaults to true) or provided as true. To avoid truncation, increase `--max-batch-tokens` to at least `32768` and set `--auto-truncate false`.
# 2026-02-25T08:57:55.516177Z  INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 16384
# 2026-02-25T08:57:55.516350Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 47 tokenization workers
# 2026-02-25T08:57:55.516488Z  INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
# 2026-02-25T08:57:55.516497Z  INFO text_embeddings_backend: backends/src/lib.rs:595: Downloading `model.safetensors`
# 2026-02-25T08:57:55.516519Z  INFO text_embeddings_backend: backends/src/lib.rs:430: Model weights downloaded in 21.351µs
# 2026-02-25T08:57:55.516524Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:766: Downloading `modules.json`
# 2026-02-25T08:57:55.516549Z  INFO text_embeddings_backend: backends/src/lib.rs:442: Dense modules downloaded in 27.182µs
# 2026-02-25T08:57:55.892112Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:513: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
# 2026-02-25T09:00:50.506020Z  INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
# 2026-02-25T09:00:51.273949Z  WARN text_embeddings_router: router/src/lib.rs:354: Invalid hostname, defaulting to 0.0.0.0
# 2026-02-25T09:00:51.275466Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1880: Starting HTTP server: 0.0.0.0:80
# 2026-02-25T09:00:51.275471Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1881: Ready

Compare this with the 1.9 image, where the model startup sequence takes about seven seconds:

model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/models

docker run                                                              \
    --gpus all                                                          \
    -p 8080:80                                                          \
    -v $volume:/data                                                    \
    --pull always                                                       \
    ghcr.io/huggingface/text-embeddings-inference:1.9 --model-id $model
# 1.9: Pulling from huggingface/text-embeddings-inference
# Digest: sha256:070799f02adaa65d99d68b818c359b80a1a487c383055c475cb9b87ad7015b74
# Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:1.9
# 2026-02-25T08:48:25.207052Z  INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: None, served_model_name: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "b617d7234a6a", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
# 2026-02-25T08:48:25.288128Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
# 2026-02-25T08:48:25.288134Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
# 2026-02-25T08:48:25.288159Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
# 2026-02-25T08:48:25.434375Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
# 2026-02-25T08:48:25.639723Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
# 2026-02-25T08:48:25.755877Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
# 2026-02-25T08:48:25.870455Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
# 2026-02-25T08:48:25.987672Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
# 2026-02-25T08:48:26.103154Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
# 2026-02-25T08:48:26.217940Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
# 2026-02-25T08:48:26.218000Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
# 2026-02-25T08:48:26.218014Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
# 2026-02-25T08:48:26.218037Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 929.909845ms
# 2026-02-25T08:48:26.483177Z  WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
# 2026-02-25T08:48:26.483192Z  WARN text_embeddings_router: router/src/lib.rs:214: The maximum input length is `32768` which exceeds `--max-batch-tokens=16384`. Input sequences will be truncated to `16384` tokens, as `--auto-truncate` is either not provided (defaults to true) or provided as true. To avoid truncation, increase `--max-batch-tokens` to at least `32768` and set `--auto-truncate false`.
# 2026-02-25T08:48:26.483194Z  INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 16384
# 2026-02-25T08:48:26.483365Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 47 tokenization workers
# 2026-02-25T08:48:26.483498Z  INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
# 2026-02-25T08:48:26.483503Z  INFO text_embeddings_backend: backends/src/lib.rs:595: Downloading `model.safetensors`
# 2026-02-25T08:48:26.483527Z  INFO text_embeddings_backend: backends/src/lib.rs:430: Model weights downloaded in 23.745µs
# 2026-02-25T08:48:26.483542Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:766: Downloading `modules.json`
# 2026-02-25T08:48:26.483570Z  INFO text_embeddings_backend: backends/src/lib.rs:442: Dense modules downloaded in 36.881µs
# 2026-02-25T08:48:26.908447Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:513: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
# 2026-02-25T08:48:34.009408Z  INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
# 2026-02-25T08:48:34.773056Z  WARN text_embeddings_router: router/src/lib.rs:354: Invalid hostname, defaulting to 0.0.0.0
# 2026-02-25T08:48:34.774555Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1880: Starting HTTP server: 0.0.0.0:80
# 2026-02-25T08:48:34.774560Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1881: Ready

What might be the cause of this? Is there any way to debug this or speed things up? I'd love to use the cuda- images to avoid having to maintain architecture-specific base image configurations. Unfortunately, the startup time is rather excessive for automated tests or in contexts of on-demand model loading.

Expected behavior

I would have expected the duration of the init sequence of both containers to not differ significantly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions