Skip to content

Add optional raw binary embeddings output to reduce serialization overhead #864

@uasan

Description

@uasan

Feature request

Hi! First of all — TEI is awesome. It's one of the fastest and most practical embedding servers out there 👍

I want to suggest a relatively simple but impactful improvement:
support returning embeddings as raw binary (float32 bytes), not just JSON / protobuf.


Why this matters

In many real-world setups embeddings are used like this:

inputs → embeddings → vector DB (FAISS / Milvus / etc.)

In this pipeline:

  • embeddings are dense float32 vectors
  • all vectors have the same dimension
  • number of vectors = len(inputs)

So the response is basically just a matrix [N, D] of float32.

Because of that:

  • JSON is very inefficient (string conversion, large payload)
  • protobuf (gRPC) is better, but still adds serialization overhead

At scale, serialization becomes a bottleneck, not inference.


Real-world precedent (vLLM)

This approach is already used in practice.

Some high-performance embedding deployments based on vLLM expose embeddings as raw float32 bytes instead of JSON or protobuf.

From practical experience:

  • the API is extremely simple (fixed dtype + known shape)
  • no real parsing is needed — just reinterpret the buffer
  • it integrates very well with vector DB pipelines (FAISS, etc.)
  • noticeably reduces latency and CPU usage under load

In other words, this is not just a theoretical optimization — it’s a proven pattern in production systems.


What I’m proposing

Add an optional response format:

Content-Type: application/octet-stream

Response body:

float32[N * D]  (contiguous, little-endian)

Client-side usage is trivial:

emb = np.frombuffer(response.content, dtype=np.float32).reshape(N, D)

No real “parsing” needed — just reinterpret bytes.


Summary

For fixed-size dense embeddings, raw binary transport is the most efficient possible format.

Adding this would let TEI:

  • remove serialization as a bottleneck
  • better support high-throughput pipelines
  • align with patterns already used in optimized embedding systems (e.g. vLLM-based setups)

Happy to help with benchmarks or testing if this is interesting 👍

Motivation

Why not just use gRPC?

gRPC definitely helps compared to JSON, but for this specific case:

  • data is homogeneous (float32[])
  • shape is known (N, D)
  • no extra metadata needed

So protobuf doesn’t give much benefit, but still:

  • does encoding/decoding
  • allocates memory
  • adds CPU overhead

In contrast, raw bytes are basically zero-overhead transport.


Expected benefits

  • lower latency (especially for large batches)
  • higher throughput
  • less CPU usage on both sides
  • easier zero-copy pipelines into vector DBs

This is particularly useful for:

  • high-QPS embedding services
  • batch ingestion pipelines
  • latency-sensitive systems

API ideas (any of these would work)

Option 1 (content negotiation):

POST /embed
Accept: application/octet-stream

Option 2 (query param):

POST /embed?format=binary

Option 3 (separate endpoint):

POST /embed_binary

Compatibility

  • fully backward compatible
  • JSON remains default
  • binary mode is opt-in

Your contribution

I can help with testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions