Feature request
Hi! First of all — TEI is awesome. It's one of the fastest and most practical embedding servers out there 👍
I want to suggest a relatively simple but impactful improvement:
support returning embeddings as raw binary (float32 bytes), not just JSON / protobuf.
Why this matters
In many real-world setups embeddings are used like this:
inputs → embeddings → vector DB (FAISS / Milvus / etc.)
In this pipeline:
- embeddings are dense float32 vectors
- all vectors have the same dimension
- number of vectors =
len(inputs)
So the response is basically just a matrix [N, D] of float32.
Because of that:
- JSON is very inefficient (string conversion, large payload)
- protobuf (gRPC) is better, but still adds serialization overhead
At scale, serialization becomes a bottleneck, not inference.
Real-world precedent (vLLM)
This approach is already used in practice.
Some high-performance embedding deployments based on vLLM expose embeddings as raw float32 bytes instead of JSON or protobuf.
From practical experience:
- the API is extremely simple (fixed dtype + known shape)
- no real parsing is needed — just reinterpret the buffer
- it integrates very well with vector DB pipelines (FAISS, etc.)
- noticeably reduces latency and CPU usage under load
In other words, this is not just a theoretical optimization — it’s a proven pattern in production systems.
What I’m proposing
Add an optional response format:
Content-Type: application/octet-stream
Response body:
float32[N * D] (contiguous, little-endian)
Client-side usage is trivial:
emb = np.frombuffer(response.content, dtype=np.float32).reshape(N, D)
No real “parsing” needed — just reinterpret bytes.
Summary
For fixed-size dense embeddings, raw binary transport is the most efficient possible format.
Adding this would let TEI:
- remove serialization as a bottleneck
- better support high-throughput pipelines
- align with patterns already used in optimized embedding systems (e.g. vLLM-based setups)
Happy to help with benchmarks or testing if this is interesting 👍
Motivation
Why not just use gRPC?
gRPC definitely helps compared to JSON, but for this specific case:
- data is homogeneous (
float32[])
- shape is known (
N, D)
- no extra metadata needed
So protobuf doesn’t give much benefit, but still:
- does encoding/decoding
- allocates memory
- adds CPU overhead
In contrast, raw bytes are basically zero-overhead transport.
Expected benefits
- lower latency (especially for large batches)
- higher throughput
- less CPU usage on both sides
- easier zero-copy pipelines into vector DBs
This is particularly useful for:
- high-QPS embedding services
- batch ingestion pipelines
- latency-sensitive systems
API ideas (any of these would work)
Option 1 (content negotiation):
POST /embed
Accept: application/octet-stream
Option 2 (query param):
POST /embed?format=binary
Option 3 (separate endpoint):
Compatibility
- fully backward compatible
- JSON remains default
- binary mode is opt-in
Your contribution
I can help with testing
Feature request
Hi! First of all — TEI is awesome. It's one of the fastest and most practical embedding servers out there 👍
I want to suggest a relatively simple but impactful improvement:
support returning embeddings as raw binary (float32 bytes), not just JSON / protobuf.
Why this matters
In many real-world setups embeddings are used like this:
In this pipeline:
len(inputs)So the response is basically just a matrix
[N, D]of float32.Because of that:
At scale, serialization becomes a bottleneck, not inference.
Real-world precedent (vLLM)
This approach is already used in practice.
Some high-performance embedding deployments based on vLLM expose embeddings as raw float32 bytes instead of JSON or protobuf.
From practical experience:
In other words, this is not just a theoretical optimization — it’s a proven pattern in production systems.
What I’m proposing
Add an optional response format:
Response body:
Client-side usage is trivial:
No real “parsing” needed — just reinterpret bytes.
Summary
For fixed-size dense embeddings, raw binary transport is the most efficient possible format.
Adding this would let TEI:
Happy to help with benchmarks or testing if this is interesting 👍
Motivation
Why not just use gRPC?
gRPC definitely helps compared to JSON, but for this specific case:
float32[])N,D)So protobuf doesn’t give much benefit, but still:
In contrast, raw bytes are basically zero-overhead transport.
Expected benefits
This is particularly useful for:
API ideas (any of these would work)
Option 1 (content negotiation):
Option 2 (query param):
Option 3 (separate endpoint):
Compatibility
Your contribution
I can help with testing