Skip to content

The PPLX 0.6B model fails to fit within the memory capacity of an A100, resulting in out-of-memory errors. #861

@vmarchenkoff

Description

@vmarchenkoff

System Info

Hello,

The empty A100 fails to load the PPLX 0.6B model (https://huggingface.co/perplexity-ai/pplx-embed-v1-0.6b) due to out-of-memory errors. Using --max-batch-tokens 16384 works almost fine, but I still observe some unexpected load.

Image

With 32768 (docker-compose.yml is presented below) it looks like:

Image

and can not run.

logs:

/opt/tei sudo docker compose up
[+] Running 1/1
 ✔ Container text-embeddings-server-pplx  Recreated                      0.0s
Attaching to text-embeddings-server-pplx
text-embeddings-server-pplx  | 2026-04-05T17:24:18.674584Z  INFO text_embeddings_router: router/src/main.rs:216: Args { model_id: "/roo*/******/****-*****-**-0.6b", revision: None, tokenization_workers: None, dtype: Some(Float32), served_model_name: Some("perplexity-emb"), pooling: None, max_concurrent_requests: 1, max_batch_tokens: 32768, max_batch_requests: Some(1), max_client_batch_size: 1, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "f7b127b4b487", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: Some("<>"), json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971589Z  WARN text_embeddings_router: router/src/lib.rs:203: Could not find a Sentence Transformers config
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971607Z  INFO text_embeddings_router: router/src/lib.rs:221: Maximum number of tokens per request: 32768
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971741Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 15 tokenization workers
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971876Z  INFO text_embeddings_router: router/src/lib.rs:271: Starting model backend
text-embeddings-server-pplx  | 2026-04-05T17:24:18.971916Z  WARN text_embeddings_backend: backends/src/lib.rs:472: Failed to parse local modules.json: unknown variant `st_quantize.FlexibleQuantizer`, expected one of `sentence_transformers.models.Dense`, `sentence_transformers.models.Normalize`, `sentence_transformers.models.Pooling`, `sentence_transformers.models.Transformer` at line 18 column 43
text-embeddings-server-pplx  | 2026-04-05T17:24:19.614281Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:544: Starting Pplx1 model on Cuda(CudaDevice(DeviceId(1)))
text-embeddings-server-pplx  | 2026-04-05T17:24:28.383923Z  INFO text_embeddings_router: router/src/lib.rs:289: Warming up model
text-embeddings-server-pplx  |
text-embeddings-server-pplx  | thread '<unnamed>' (46) panicked at /root/.cargo/git/checkouts/cudarc-a338a6fe3117fe87/8b4f18b/src/driver/safe/core.rs:283:26:
text-embeddings-server-pplx  | called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
text-embeddings-server-pplx  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
text-embeddings-server-pplx exited with code 0

configuration:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:13:00.0 Off |                    0 |
| N/A   34C    P0             42W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:1C:00.0 Off |                    0 |
| N/A   37C    P0             66W /  300W |   80055MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A         1002335      C   VLLM::EngineCore                      14892MiB |
|    1   N/A  N/A         1003190      C   sglang::scheduler                     65140MiB |
+-----------------------------------------------------------------------------------------+

gemma embeddings works fine being almost the same weight.

Thank you!

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

services:
  text-embeddings-pplx:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    shm_size: 1g
    container_name: text-embeddings-server-pplx
    restart: unless-stopped
    ports:
      - "1589:80"
    volumes:
      - /opt/new_models:/root/models
    command: >
      --model-id /root/models/pplx-embed-v1-0.6b
      --served-model-name perplexity-emb
      --dtype float32
      --max-batch-tokens 32768
      --max-concurrent-requests 1
      --max-client-batch-size 1
      --max-batch-requests 1
      --api-key <>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    runtime: nvidia

Expected behavior

I suspect that a 0.6B model should be able to fit on an A100 with full context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions