The PPLX 0.6B model fails to fit within the memory capacity of an A100, resulting in out-of-memory errors.

### System Info

Hello,

The empty A100 fails to load the PPLX 0.6B model (https://huggingface.co/perplexity-ai/pplx-embed-v1-0.6b) due to out-of-memory errors. Using --max-batch-tokens 16384 works almost fine, but I still observe some unexpected load.

<img width="2000" height="862" alt="Image" src="https://github.com/user-attachments/assets/8d1313d0-7a55-4a6a-ac7c-df0def4d87b1" />

With 32768 (docker-compose.yml is presented below) it looks like:

<img width="2000" height="877" alt="Image" src="https://github.com/user-attachments/assets/75c85dc9-bc3f-40fd-9c79-bdc5dea895f2" />

and can not run.

logs:

```
/opt/tei sudo docker compose up
[+] Running 1/1
 ✔ Container text-embeddings-server-pplx  Recreated                      0.0s
Attaching to text-embeddings-server-pplx
text-embeddings-server-pplx  | 2026-04-05T17:24:18.674584Z  INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "/roo*/******/****-*****-**-0.6b", revision: None, tokenization_workers: None, dtype: Some(Float32), served_model_name: Some("perplexity-emb"), pooling: None, max_concurrent_requests: 1, max_batch_tokens: 32768, max_batch_requests: Some(1), max_client_batch_size: 1, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "f7b127b4b487", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: Some("<>"), json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971589Z  WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971607Z  INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 32768
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971741Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 15 tokenization workers
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971876Z  INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971916Z  WARN text_embeddings_backend: backends/src/lib.rs:472: Failed to parse local modules.json: unknown variant `st_quantize.FlexibleQuantizer`, expected one of `sentence_transformers.models.Dense`, `sentence_transformers.models.Normalize`, `sentence_transformers.models.Pooling`, `sentence_transformers.models.Transformer` at line 18 column 43
text-embeddings-server-pplx  | 2026-04-05T17:24:19.614281Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:544: Starting Pplx1 model on Cuda(CudaDevice(DeviceId(1)))
text-embeddings-server-pplx  | 2026-04-05T17:24:28.383923Z  INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
text-embeddings-server-pplx  |
text-embeddings-server-pplx  | thread '<unnamed>' (46) panicked at /root/.cargo/git/checkouts/cudarc-a338a6fe3117fe87/8b4f18b/src/driver/safe/core.rs:283:26:
text-embeddings-server-pplx  | called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
text-embeddings-server-pplx  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
text-embeddings-server-pplx exited with code 0
```

configuration:
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:13:00.0 Off |                    0 |
| N/A   34C    P0             42W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:1C:00.0 Off |                    0 |
| N/A   37C    P0             66W /  300W |   80055MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A         1002335      C   VLLM::EngineCore                      14892MiB |
|    1   N/A  N/A         1003190      C   sglang::scheduler                     65140MiB |
+-----------------------------------------------------------------------------------------+
```

gemma embeddings works fine being almost the same weight.

Thank you!

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

```
services:
  text-embeddings-pplx:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    shm_size: 1g
    container_name: text-embeddings-server-pplx
    restart: unless-stopped
    ports:
      - "1589:80"
    volumes:
      - /opt/new_models:/root/models
    command: >
      --model-id /root/models/pplx-embed-v1-0.6b
      --served-model-name perplexity-emb
      --dtype float32
      --max-batch-tokens 32768
      --max-concurrent-requests 1
      --max-client-batch-size 1
      --max-batch-requests 1
      --api-key <>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    runtime: nvidia
```

### Expected behavior

I suspect that a 0.6B model should be able to fit on an A100 with full context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The PPLX 0.6B model fails to fit within the memory capacity of an A100, resulting in out-of-memory errors. #861

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The PPLX 0.6B model fails to fit within the memory capacity of an A100, resulting in out-of-memory errors. #861

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions