Add optional raw binary embeddings output to reduce serialization overhead

### Feature request

Hi! First of all — TEI is awesome. It's one of the fastest and most practical embedding servers out there 👍

I want to suggest a relatively simple but impactful improvement:  
**support returning embeddings as raw binary (float32 bytes), not just JSON / protobuf.**

---

## Why this matters

In many real-world setups embeddings are used like this:

```
inputs → embeddings → vector DB (FAISS / Milvus / etc.)
```

In this pipeline:
- embeddings are **dense float32 vectors**
- all vectors have the **same dimension**
- number of vectors = `len(inputs)`

So the response is basically just a matrix `[N, D]` of float32.

Because of that:
- JSON is very inefficient (string conversion, large payload)
- protobuf (gRPC) is better, but still adds serialization overhead

At scale, **serialization becomes a bottleneck, not inference**.

---

## Real-world precedent (vLLM)

This approach is already used in practice.

Some high-performance embedding deployments based on vLLM expose embeddings as **raw float32 bytes** instead of JSON or protobuf.

From practical experience:
- the API is extremely simple (fixed dtype + known shape)
- no real parsing is needed — just reinterpret the buffer
- it integrates very well with vector DB pipelines (FAISS, etc.)
- noticeably reduces latency and CPU usage under load

In other words, this is not just a theoretical optimization — it’s a **proven pattern in production systems**.

---

## What I’m proposing

Add an optional response format:

```
Content-Type: application/octet-stream
```

Response body:

```
float32[N * D]  (contiguous, little-endian)
```

Client-side usage is trivial:

```python
emb = np.frombuffer(response.content, dtype=np.float32).reshape(N, D)
```

No real “parsing” needed — just reinterpret bytes.

---



## Summary

For fixed-size dense embeddings, raw binary transport is the most efficient possible format.

Adding this would let TEI:
- remove serialization as a bottleneck
- better support high-throughput pipelines
- align with patterns already used in optimized embedding systems (e.g. vLLM-based setups)

---

Happy to help with benchmarks or testing if this is interesting 👍

### Motivation

## Why not just use gRPC?

gRPC definitely helps compared to JSON, but for this specific case:

- data is homogeneous (`float32[]`)
- shape is known (`N`, `D`)
- no extra metadata needed

So protobuf doesn’t give much benefit, but still:
- does encoding/decoding
- allocates memory
- adds CPU overhead

In contrast, raw bytes are basically **zero-overhead transport**.

---

## Expected benefits

- lower latency (especially for large batches)
- higher throughput
- less CPU usage on both sides
- easier zero-copy pipelines into vector DBs

This is particularly useful for:
- high-QPS embedding services
- batch ingestion pipelines
- latency-sensitive systems

---

## API ideas (any of these would work)

**Option 1 (content negotiation):**
```
POST /embed
Accept: application/octet-stream
```

**Option 2 (query param):**
```
POST /embed?format=binary
```

**Option 3 (separate endpoint):**
```
POST /embed_binary
```

---

## Compatibility

- fully backward compatible
- JSON remains default
- binary mode is opt-in

---

### Your contribution

I can help with testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional raw binary embeddings output to reduce serialization overhead #864

Feature request

Why this matters

Real-world precedent (vLLM)

What I’m proposing

Summary

Motivation

Why not just use gRPC?

Expected benefits

API ideas (any of these would work)

Compatibility

Your contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add optional raw binary embeddings output to reduce serialization overhead #864

Description

Feature request

Why this matters

Real-world precedent (vLLM)

What I’m proposing

Summary

Motivation

Why not just use gRPC?

Expected benefits

API ideas (any of these would work)

Compatibility

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions