Description
When Agent Zero stores or searches memory documents that exceed the embedding model's context window (e.g. BAAI/bge-m3: 8192 tokens), the embedding call crashes with a 400 error:
```
litellm.BadRequestError: Error code: 400 - You passed 8193 input tokens and requested 0 output tokens.
However, the model's context length is only 8192 tokens.
```
This is a hard crash — the agent errors out and memory operations fail completely.
Root Cause
Two separate issues in LiteLLMEmbeddingWrapper:
Issue 1: LiteLLM ≥1.80.11 sends encoding_format: null in embedding requests when the parameter is not set. Strict OpenAI-compatible validators (DeepInfra, vLLM, HuggingFace TEI) reject null with 422. (Upstream LiteLLM issue: BerriAI/litellm#19174)
Issue 2: There is no input truncation before embedding calls. If a memory document exceeds the model's context window, the API returns 400. Additionally, when truncation is applied using cl100k_base (GPT tokenizer), it can undercount tokens compared to the model's own tokenizer (e.g. bge-m3 SentencePiece), causing 400 errors even at the apparent limit.
Steps to Reproduce
- Configure an OpenAI-compatible embedding provider via
api_base (e.g. DeepInfra with BAAI/bge-m3)
- Ask the agent to memorize or search a long document (>8192 tokens)
- The embedding call raises
BadRequestError: 400
Fix
In models.py, LiteLLMEmbeddingWrapper.embed_documents and embed_query:
- Default
encoding_format to "float" before merging kwargs (prevents null being sent)
- Truncate input to
ctx_length - 500 tokens before embedding (the 500-token margin accounts for tokenizer divergence between cl100k_base used for counting and the model's actual tokenizer)
PR: #PLACEHOLDER
Description
When Agent Zero stores or searches memory documents that exceed the embedding model's context window (e.g. BAAI/bge-m3: 8192 tokens), the embedding call crashes with a 400 error:
```
litellm.BadRequestError: Error code: 400 - You passed 8193 input tokens and requested 0 output tokens.
However, the model's context length is only 8192 tokens.
```
This is a hard crash — the agent errors out and memory operations fail completely.
Root Cause
Two separate issues in
LiteLLMEmbeddingWrapper:Issue 1: LiteLLM ≥1.80.11 sends
encoding_format: nullin embedding requests when the parameter is not set. Strict OpenAI-compatible validators (DeepInfra, vLLM, HuggingFace TEI) rejectnullwith 422. (Upstream LiteLLM issue: BerriAI/litellm#19174)Issue 2: There is no input truncation before embedding calls. If a memory document exceeds the model's context window, the API returns 400. Additionally, when truncation is applied using
cl100k_base(GPT tokenizer), it can undercount tokens compared to the model's own tokenizer (e.g. bge-m3 SentencePiece), causing 400 errors even at the apparent limit.Steps to Reproduce
api_base(e.g. DeepInfra withBAAI/bge-m3)BadRequestError: 400Fix
In
models.py,LiteLLMEmbeddingWrapper.embed_documentsandembed_query:encoding_formatto"float"before merging kwargs (prevents null being sent)ctx_length - 500tokens before embedding (the 500-token margin accounts for tokenizer divergence between cl100k_base used for counting and the model's actual tokenizer)PR: #PLACEHOLDER