fix: make embedding calls robust against context overflow and strict endpoint validation by PaoloC68 · Pull Request #1437 · agent0ai/agent-zero

PaoloC68 · 2026-04-04T03:09:54Z

Summary

Three layered fixes for embedding calls crashing with 400 context-length errors during memory recall.

Root Cause

When memory_recall_query_prep: false (the default), _50_recall_memories.py uses a raw fallback query of user_message + history[-10000 chars]. At ~1 char/token content density (code, CJK), this exceeds BAAI/bge-m3's 8,192-token limit. The retry logic in models.py was also silently bypassed because "input_tokens" in str(e) did not match the litellm.BadRequestError string representation reliably.

Changes

File	Change
`models.py`	Default `encoding_format="float"` — LiteLLM ≥1.80.11 sends null, rejected with 422 by strict endpoints
`models.py`	Truncate input to `ctx * 0.80` tokens before embedding (20% margin for tokenizer divergence)
`models.py`	Loop-halve on 400 errors using `status_code == 400` (reliable) instead of string matching (fragile)
`plugins/_memory/extensions/python/message_loop_prompts_after/_50_recall_memories.py`	Cap fallback query at 4,000 chars — sufficient for semantic retrieval, prevents token overflow

Fix 1 — `encoding_format: null` (422)

LiteLLM ≥1.80.11 sends encoding_format: null when not set. Strict validators (DeepInfra, vLLM, HuggingFace TEI) reject null with 422. Defaulting to "float" before merging caller kwargs prevents this.

Upstream LiteLLM issue: BerriAI/litellm#19174

Fix 2 — Input truncation with loop-halving fallback (400)

Embedding models have a fixed context window. trim_to_tokens uses cl100k_base (GPT tokenizer) for counting, but the model uses its own tokenizer (e.g. bge-m3 SentencePiece). These diverge by 6–25% on the same text — dense code can have 2× more bge-m3 tokens than cl100k tokens. A 20% pre-reduction handles most cases; the loop-halving fallback handles extreme divergence.

The retry condition uses getattr(e, "status_code", None) == 400 rather than string matching, which is reliable across litellm versions.

Fix 3 — Cap fallback memory query at source (400)

When memory_recall_query_prep: false, the fallback query is user_message + history[-10000 chars]. Capping at 4,000 chars (~1,000 tokens for any content type) eliminates the overflow at source. Semantic similarity search does not benefit from 10,000 characters of raw history.

Testing

Tested on helpa0.com with DeepInfra + openai/BAAI/bge-m3 (8,192-token context), LiteLLM 1.82.3, memory_recall_query_prep: false:

Memory recall on long conversations: no longer crashes
Dense code content in memory: no longer crashes
Short queries: no overhead (truncation only fires when needed)

LiteLLM >=1.80.11 sends encoding_format=null in embedding requests when the parameter is not explicitly set (BerriAI/litellm#19174). Strict validators such as DeepInfra, vLLM, and HuggingFace TEI reject null with: 422 Unprocessable Entity: Input should be 'float' or 'base64' Default to 'float' (the OpenAI spec default) before merging caller kwargs, so any explicitly configured value still takes precedence.

…kenizer divergence cl100k_base and bge-m3's SentencePiece tokenizer diverge by ~2-3% on the same text. A document at exactly ctx_length cl100k tokens can exceed ctx_length model tokens, causing 400 errors. 500-token margin provides sufficient headroom.

Fixed-500-token margin was insufficient: cl100k_base and bge-m3 SentencePiece diverge by up to ~6.5% on the same content. A text with 7692 cl100k tokens can have 8193 bge-m3 tokens, just over the 8192 limit. 20% reduction (ctx * 0.80) provides safe headroom for up to 25% tokenizer divergence across any model size, without needing to know the exact divergence for a given content type.

Token counting with cl100k_base is unreliable as a guard for models with different tokenizers (bge-m3 SentencePiece). For dense code content, bge-m3 can use 2x+ more tokens than cl100k for the same text, so no static margin is sufficient. On a 400 context-length error, retry once with 50% of the text — guaranteed to succeed for any content type with no API overhead on normal inputs.

…t-length errors String matching on str(e) was unreliable — litellm exception str() representation varies by version. HTTP 400 is the canonical signal for context-length errors from OpenAI-compatible embedding endpoints.

When memory_recall_query_prep=false (default), the fallback query is user_message + full_history (up to 10000 chars). At dense content densities (~1 char/token) this exceeds the embedding model's 8192-token limit. 4000 chars is ~1000 tokens for any content type, more than sufficient for semantic similarity search.

PaoloC68 added 2 commits April 4, 2026 05:08

PaoloC68 mentioned this pull request Apr 4, 2026

fix: prevent embedding 422 on strict OpenAI-compatible endpoints (DeepInfra, vLLM, TEI) #1430

Closed

PaoloC68 added 4 commits April 4, 2026 05:16

PaoloC68 mentioned this pull request Apr 4, 2026

Embedding calls crash with 400 when stored memory exceeds model context window #1436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make embedding calls robust against context overflow and strict endpoint validation#1437

fix: make embedding calls robust against context overflow and strict endpoint validation#1437
PaoloC68 wants to merge 6 commits intoagent0ai:developmentfrom
PaoloC68:fix/embedding-robustness

PaoloC68 commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PaoloC68 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Fix 1 — encoding_format: null (422)

Fix 2 — Input truncation with loop-halving fallback (400)

Fix 3 — Cap fallback memory query at source (400)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PaoloC68 commented Apr 4, 2026 •

edited

Loading

Fix 1 — `encoding_format: null` (422)