and can not run.
/opt/tei sudo docker compose up
[+] Running 1/1
✔ Container text-embeddings-server-pplx Recreated 0.0s
Attaching to text-embeddings-server-pplx
text-embeddings-server-pplx | 2026-04-05T17:24:18.674584Z INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "/roo*/******/****-*****-**-0.6b", revision: None, tokenization_workers: None, dtype: Some(Float32), served_model_name: Some("perplexity-emb"), pooling: None, max_concurrent_requests: 1, max_batch_tokens: 32768, max_batch_requests: Some(1), max_client_batch_size: 1, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "f7b127b4b487", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: Some("<>"), json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
text-embeddings-server-pplx | 2026-04-05T17:24:18.971589Z WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
text-embeddings-server-pplx | 2026-04-05T17:24:18.971607Z INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 32768
text-embeddings-server-pplx | 2026-04-05T17:24:18.971741Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 15 tokenization workers
text-embeddings-server-pplx | 2026-04-05T17:24:18.971876Z INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
text-embeddings-server-pplx | 2026-04-05T17:24:18.971916Z WARN text_embeddings_backend: backends/src/lib.rs:472: Failed to parse local modules.json: unknown variant `st_quantize.FlexibleQuantizer`, expected one of `sentence_transformers.models.Dense`, `sentence_transformers.models.Normalize`, `sentence_transformers.models.Pooling`, `sentence_transformers.models.Transformer` at line 18 column 43
text-embeddings-server-pplx | 2026-04-05T17:24:19.614281Z INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:544: Starting Pplx1 model on Cuda(CudaDevice(DeviceId(1)))
text-embeddings-server-pplx | 2026-04-05T17:24:28.383923Z INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
text-embeddings-server-pplx |
text-embeddings-server-pplx | thread '<unnamed>' (46) panicked at /root/.cargo/git/checkouts/cudarc-a338a6fe3117fe87/8b4f18b/src/driver/safe/core.rs:283:26:
text-embeddings-server-pplx | called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
text-embeddings-server-pplx | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
text-embeddings-server-pplx exited with code 0
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:13:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:1C:00.0 Off | 0 |
| N/A 37C P0 66W / 300W | 80055MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 1 N/A N/A 1002335 C VLLM::EngineCore 14892MiB |
| 1 N/A N/A 1003190 C sglang::scheduler 65140MiB |
+-----------------------------------------------------------------------------------------+
gemma embeddings works fine being almost the same weight.
services:
text-embeddings-pplx:
image: ghcr.io/huggingface/text-embeddings-inference:1.9
shm_size: 1g
container_name: text-embeddings-server-pplx
restart: unless-stopped
ports:
- "1589:80"
volumes:
- /opt/new_models:/root/models
command: >
--model-id /root/models/pplx-embed-v1-0.6b
--served-model-name perplexity-emb
--dtype float32
--max-batch-tokens 32768
--max-concurrent-requests 1
--max-client-batch-size 1
--max-batch-requests 1
--api-key <>
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
runtime: nvidia
I suspect that a 0.6B model should be able to fit on an A100 with full context.
System Info
Hello,
The empty A100 fails to load the PPLX 0.6B model (https://huggingface.co/perplexity-ai/pplx-embed-v1-0.6b) due to out-of-memory errors. Using --max-batch-tokens 16384 works almost fine, but I still observe some unexpected load.
With 32768 (docker-compose.yml is presented below) it looks like:
and can not run.
logs:
configuration:
gemma embeddings works fine being almost the same weight.
Thank you!
Information
Tasks
Reproduction
Expected behavior
I suspect that a 0.6B model should be able to fit on an A100 with full context.