Skip to content

fix: make embedding calls robust against context overflow and strict endpoint validation#1437

Open
PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
PaoloC68:fix/embedding-robustness
Open

fix: make embedding calls robust against context overflow and strict endpoint validation#1437
PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
PaoloC68:fix/embedding-robustness

Conversation

@PaoloC68
Copy link
Copy Markdown
Contributor

@PaoloC68 PaoloC68 commented Apr 4, 2026

Fixes #1436

Summary

Three layered fixes for embedding calls crashing with 400 context-length errors during memory recall.

Root Cause

When memory_recall_query_prep: false (the default), _50_recall_memories.py uses a raw fallback query of user_message + history[-10000 chars]. At ~1 char/token content density (code, CJK), this exceeds BAAI/bge-m3's 8,192-token limit. The retry logic in models.py was also silently bypassed because "input_tokens" in str(e) did not match the litellm.BadRequestError string representation reliably.

Changes

File Change
models.py Default encoding_format="float" — LiteLLM ≥1.80.11 sends null, rejected with 422 by strict endpoints
models.py Truncate input to ctx * 0.80 tokens before embedding (20% margin for tokenizer divergence)
models.py Loop-halve on 400 errors using status_code == 400 (reliable) instead of string matching (fragile)
plugins/_memory/extensions/python/message_loop_prompts_after/_50_recall_memories.py Cap fallback query at 4,000 chars — sufficient for semantic retrieval, prevents token overflow

Fix 1 — encoding_format: null (422)

LiteLLM ≥1.80.11 sends encoding_format: null when not set. Strict validators (DeepInfra, vLLM, HuggingFace TEI) reject null with 422. Defaulting to "float" before merging caller kwargs prevents this.

Upstream LiteLLM issue: BerriAI/litellm#19174

Fix 2 — Input truncation with loop-halving fallback (400)

Embedding models have a fixed context window. trim_to_tokens uses cl100k_base (GPT tokenizer) for counting, but the model uses its own tokenizer (e.g. bge-m3 SentencePiece). These diverge by 6–25% on the same text — dense code can have 2× more bge-m3 tokens than cl100k tokens. A 20% pre-reduction handles most cases; the loop-halving fallback handles extreme divergence.

The retry condition uses getattr(e, "status_code", None) == 400 rather than string matching, which is reliable across litellm versions.

Fix 3 — Cap fallback memory query at source (400)

When memory_recall_query_prep: false, the fallback query is user_message + history[-10000 chars]. Capping at 4,000 chars (~1,000 tokens for any content type) eliminates the overflow at source. Semantic similarity search does not benefit from 10,000 characters of raw history.

Testing

Tested on helpa0.com with DeepInfra + openai/BAAI/bge-m3 (8,192-token context), LiteLLM 1.82.3, memory_recall_query_prep: false:

  • Memory recall on long conversations: no longer crashes
  • Dense code content in memory: no longer crashes
  • Short queries: no overhead (truncation only fires when needed)

PaoloC68 added 2 commits April 4, 2026 05:08
LiteLLM >=1.80.11 sends encoding_format=null in embedding requests when the
parameter is not explicitly set (BerriAI/litellm#19174). Strict validators
such as DeepInfra, vLLM, and HuggingFace TEI reject null with:

  422 Unprocessable Entity: Input should be 'float' or 'base64'

Default to 'float' (the OpenAI spec default) before merging caller kwargs,
so any explicitly configured value still takes precedence.
…kenizer divergence

cl100k_base and bge-m3's SentencePiece tokenizer diverge by ~2-3% on the same
text. A document at exactly ctx_length cl100k tokens can exceed ctx_length
model tokens, causing 400 errors. 500-token margin provides sufficient headroom.
PaoloC68 added 4 commits April 4, 2026 05:16
Fixed-500-token margin was insufficient: cl100k_base and bge-m3 SentencePiece
diverge by up to ~6.5% on the same content. A text with 7692 cl100k tokens
can have 8193 bge-m3 tokens, just over the 8192 limit.

20% reduction (ctx * 0.80) provides safe headroom for up to 25% tokenizer
divergence across any model size, without needing to know the exact divergence
for a given content type.
Token counting with cl100k_base is unreliable as a guard for models with
different tokenizers (bge-m3 SentencePiece). For dense code content, bge-m3
can use 2x+ more tokens than cl100k for the same text, so no static margin
is sufficient. On a 400 context-length error, retry once with 50% of the
text — guaranteed to succeed for any content type with no API overhead
on normal inputs.
…t-length errors

String matching on str(e) was unreliable — litellm exception str() representation
varies by version. HTTP 400 is the canonical signal for context-length errors
from OpenAI-compatible embedding endpoints.
When memory_recall_query_prep=false (default), the fallback query is
user_message + full_history (up to 10000 chars). At dense content densities
(~1 char/token) this exceeds the embedding model's 8192-token limit.
4000 chars is ~1000 tokens for any content type, more than sufficient
for semantic similarity search.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant