fix: make embedding calls robust against context overflow and strict endpoint validation#1437
Open
PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
Open
fix: make embedding calls robust against context overflow and strict endpoint validation#1437PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
Conversation
LiteLLM >=1.80.11 sends encoding_format=null in embedding requests when the parameter is not explicitly set (BerriAI/litellm#19174). Strict validators such as DeepInfra, vLLM, and HuggingFace TEI reject null with: 422 Unprocessable Entity: Input should be 'float' or 'base64' Default to 'float' (the OpenAI spec default) before merging caller kwargs, so any explicitly configured value still takes precedence.
…kenizer divergence cl100k_base and bge-m3's SentencePiece tokenizer diverge by ~2-3% on the same text. A document at exactly ctx_length cl100k tokens can exceed ctx_length model tokens, causing 400 errors. 500-token margin provides sufficient headroom.
Fixed-500-token margin was insufficient: cl100k_base and bge-m3 SentencePiece diverge by up to ~6.5% on the same content. A text with 7692 cl100k tokens can have 8193 bge-m3 tokens, just over the 8192 limit. 20% reduction (ctx * 0.80) provides safe headroom for up to 25% tokenizer divergence across any model size, without needing to know the exact divergence for a given content type.
Token counting with cl100k_base is unreliable as a guard for models with different tokenizers (bge-m3 SentencePiece). For dense code content, bge-m3 can use 2x+ more tokens than cl100k for the same text, so no static margin is sufficient. On a 400 context-length error, retry once with 50% of the text — guaranteed to succeed for any content type with no API overhead on normal inputs.
…t-length errors String matching on str(e) was unreliable — litellm exception str() representation varies by version. HTTP 400 is the canonical signal for context-length errors from OpenAI-compatible embedding endpoints.
When memory_recall_query_prep=false (default), the fallback query is user_message + full_history (up to 10000 chars). At dense content densities (~1 char/token) this exceeds the embedding model's 8192-token limit. 4000 chars is ~1000 tokens for any content type, more than sufficient for semantic similarity search.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1436
Summary
Three layered fixes for embedding calls crashing with 400 context-length errors during memory recall.
Root Cause
When
memory_recall_query_prep: false(the default),_50_recall_memories.pyuses a raw fallback query ofuser_message + history[-10000 chars]. At ~1 char/token content density (code, CJK), this exceeds BAAI/bge-m3's 8,192-token limit. The retry logic inmodels.pywas also silently bypassed because"input_tokens" in str(e)did not match thelitellm.BadRequestErrorstring representation reliably.Changes
models.pyencoding_format="float"— LiteLLM ≥1.80.11 sends null, rejected with 422 by strict endpointsmodels.pyctx * 0.80tokens before embedding (20% margin for tokenizer divergence)models.pystatus_code == 400(reliable) instead of string matching (fragile)plugins/_memory/extensions/python/message_loop_prompts_after/_50_recall_memories.pyFix 1 —
encoding_format: null(422)LiteLLM ≥1.80.11 sends
encoding_format: nullwhen not set. Strict validators (DeepInfra, vLLM, HuggingFace TEI) reject null with 422. Defaulting to"float"before merging caller kwargs prevents this.Upstream LiteLLM issue: BerriAI/litellm#19174
Fix 2 — Input truncation with loop-halving fallback (400)
Embedding models have a fixed context window.
trim_to_tokensusescl100k_base(GPT tokenizer) for counting, but the model uses its own tokenizer (e.g. bge-m3 SentencePiece). These diverge by 6–25% on the same text — dense code can have 2× more bge-m3 tokens than cl100k tokens. A 20% pre-reduction handles most cases; the loop-halving fallback handles extreme divergence.The retry condition uses
getattr(e, "status_code", None) == 400rather than string matching, which is reliable across litellm versions.Fix 3 — Cap fallback memory query at source (400)
When
memory_recall_query_prep: false, the fallback query isuser_message + history[-10000 chars]. Capping at 4,000 chars (~1,000 tokens for any content type) eliminates the overflow at source. Semantic similarity search does not benefit from 10,000 characters of raw history.Testing
Tested on helpa0.com with DeepInfra +
openai/BAAI/bge-m3(8,192-token context), LiteLLM 1.82.3,memory_recall_query_prep: false: