System Info
- os: Arch Linux x86_64 (Kernel: 6.18.9-arch1-2)
- cpu: AMD Ryzen Threadripper 3960X
- gpu: NVIDIA GeForce RTX 3090 (cuda 13.1, driver 590.48.01)
- TEI docker image: cuda-1.9, 1.9
Information
Tasks
Reproduction
Running the Docker example from the README.md with the cuda-1.9 image has the container stuck on Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1))) for full three minutes before the initialization continues:
model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/models
docker run \
--gpus all \
-p 8080:80 \
-v $volume:/data \
--pull always \
ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
# cuda-1.9: Pulling from huggingface/text-embeddings-inference
# Digest: sha256:64bfb8bdd79ec1a2ef40c8fd297102b19cdeac38adf770eb9d815f4f41e4df00
# Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
# 2026-02-25T08:57:54.301715Z INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: None, served_model_name: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "db07861e321e", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
# 2026-02-25T08:57:54.381693Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
# 2026-02-25T08:57:54.381698Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
# 2026-02-25T08:57:54.381719Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
# 2026-02-25T08:57:54.541464Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
# 2026-02-25T08:57:54.659995Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
# 2026-02-25T08:57:54.785122Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
# 2026-02-25T08:57:54.901103Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
# 2026-02-25T08:57:55.018778Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
# 2026-02-25T08:57:55.138770Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
# 2026-02-25T08:57:55.257801Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
# 2026-02-25T08:57:55.257876Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
# 2026-02-25T08:57:55.257902Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
# 2026-02-25T08:57:55.257923Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 876.23071ms
# 2026-02-25T08:57:55.516160Z WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
# 2026-02-25T08:57:55.516174Z WARN text_embeddings_router: router/src/lib.rs:214: The maximum input length is `32768` which exceeds `--max-batch-tokens=16384`. Input sequences will be truncated to `16384` tokens, as `--auto-truncate` is either not provided (defaults to true) or provided as true. To avoid truncation, increase `--max-batch-tokens` to at least `32768` and set `--auto-truncate false`.
# 2026-02-25T08:57:55.516177Z INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 16384
# 2026-02-25T08:57:55.516350Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 47 tokenization workers
# 2026-02-25T08:57:55.516488Z INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
# 2026-02-25T08:57:55.516497Z INFO text_embeddings_backend: backends/src/lib.rs:595: Downloading `model.safetensors`
# 2026-02-25T08:57:55.516519Z INFO text_embeddings_backend: backends/src/lib.rs:430: Model weights downloaded in 21.351µs
# 2026-02-25T08:57:55.516524Z INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:766: Downloading `modules.json`
# 2026-02-25T08:57:55.516549Z INFO text_embeddings_backend: backends/src/lib.rs:442: Dense modules downloaded in 27.182µs
# 2026-02-25T08:57:55.892112Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:513: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
# 2026-02-25T09:00:50.506020Z INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
# 2026-02-25T09:00:51.273949Z WARN text_embeddings_router: router/src/lib.rs:354: Invalid hostname, defaulting to 0.0.0.0
# 2026-02-25T09:00:51.275466Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1880: Starting HTTP server: 0.0.0.0:80
# 2026-02-25T09:00:51.275471Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1881: Ready
Compare this with the 1.9 image, where the model startup sequence takes about seven seconds:
model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/models
docker run \
--gpus all \
-p 8080:80 \
-v $volume:/data \
--pull always \
ghcr.io/huggingface/text-embeddings-inference:1.9 --model-id $model
# 1.9: Pulling from huggingface/text-embeddings-inference
# Digest: sha256:070799f02adaa65d99d68b818c359b80a1a487c383055c475cb9b87ad7015b74
# Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:1.9
# 2026-02-25T08:48:25.207052Z INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: None, served_model_name: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "b617d7234a6a", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
# 2026-02-25T08:48:25.288128Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
# 2026-02-25T08:48:25.288134Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
# 2026-02-25T08:48:25.288159Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
# 2026-02-25T08:48:25.434375Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
# 2026-02-25T08:48:25.639723Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
# 2026-02-25T08:48:25.755877Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
# 2026-02-25T08:48:25.870455Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
# 2026-02-25T08:48:25.987672Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
# 2026-02-25T08:48:26.103154Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
# 2026-02-25T08:48:26.217940Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
# 2026-02-25T08:48:26.218000Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
# 2026-02-25T08:48:26.218014Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
# 2026-02-25T08:48:26.218037Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 929.909845ms
# 2026-02-25T08:48:26.483177Z WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
# 2026-02-25T08:48:26.483192Z WARN text_embeddings_router: router/src/lib.rs:214: The maximum input length is `32768` which exceeds `--max-batch-tokens=16384`. Input sequences will be truncated to `16384` tokens, as `--auto-truncate` is either not provided (defaults to true) or provided as true. To avoid truncation, increase `--max-batch-tokens` to at least `32768` and set `--auto-truncate false`.
# 2026-02-25T08:48:26.483194Z INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 16384
# 2026-02-25T08:48:26.483365Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 47 tokenization workers
# 2026-02-25T08:48:26.483498Z INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
# 2026-02-25T08:48:26.483503Z INFO text_embeddings_backend: backends/src/lib.rs:595: Downloading `model.safetensors`
# 2026-02-25T08:48:26.483527Z INFO text_embeddings_backend: backends/src/lib.rs:430: Model weights downloaded in 23.745µs
# 2026-02-25T08:48:26.483542Z INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:766: Downloading `modules.json`
# 2026-02-25T08:48:26.483570Z INFO text_embeddings_backend: backends/src/lib.rs:442: Dense modules downloaded in 36.881µs
# 2026-02-25T08:48:26.908447Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:513: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
# 2026-02-25T08:48:34.009408Z INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
# 2026-02-25T08:48:34.773056Z WARN text_embeddings_router: router/src/lib.rs:354: Invalid hostname, defaulting to 0.0.0.0
# 2026-02-25T08:48:34.774555Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1880: Starting HTTP server: 0.0.0.0:80
# 2026-02-25T08:48:34.774560Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1881: Ready
What might be the cause of this? Is there any way to debug this or speed things up? I'd love to use the cuda- images to avoid having to maintain architecture-specific base image configurations. Unfortunately, the startup time is rather excessive for automated tests or in contexts of on-demand model loading.
Expected behavior
I would have expected the duration of the init sequence of both containers to not differ significantly.
System Info
Information
Tasks
Reproduction
Running the Docker example from the README.md with the
cuda-1.9image has the container stuck onStarting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))for full three minutes before the initialization continues:Compare this with the
1.9image, where the model startup sequence takes about seven seconds:What might be the cause of this? Is there any way to debug this or speed things up? I'd love to use the
cuda-images to avoid having to maintain architecture-specific base image configurations. Unfortunately, the startup time is rather excessive for automated tests or in contexts of on-demand model loading.Expected behavior
I would have expected the duration of the init sequence of both containers to not differ significantly.