🛡️ fix: Strip null content parts on message load to prevent formatAgentMessages crash#64
Conversation
📝 WalkthroughWalkthroughThe ChangesMalformed content filtering
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ntMessages crash The streaming content aggregator builds message content by index and yields a sparse array; an interrupted/partial save persists a hole that serializes to null in MongoDB. On replay, @librechat/agents formatAgentMessages reads part.type with no null guard and crashes. Sanitize content holes at getMessages, the single DB read chokepoint for conversation history, so both already-corrupted and future rows are neutralized for every consumer (formatAgentMessages, token counting, edit path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8d62d48 to
6d37259
Compare
Summary
Fixes the recurring crash:
The crash fires on a follow-up turn in a conversation — most reliably after using tool calls or after attaching PDF/PNG files. One small change in
getMessages(the single DB read path for conversation history) sanitizes message content on load, which neutralizes both already-corrupted conversations and any future ones.What actually happens (root cause)
The error is thrown inside
@librechat/agents, but the bad data is created and persisted by our side. The chain:The streaming content aggregator builds
contentby index and leaves holes.createContentAggregator()(api/server/services/Endpoints/agents/initialize.js:129) produces thecontentPartsarray that the run fills by index. Our own code documents this atapi/server/controllers/agents/client.js:162:Tool-call parts in particular are placed at sparse offsets (see the comment at
client.js:1090). A sparse hole in a JS array serializes tonullwhen written to MongoDB.The happy path cleans it — so a completed response is safe.
On normal completion,
BaseClient.sendMessagesetsresponseMessage.content = completion(api/app/clients/BaseClient.js:602), wherecompletion = filterMalformedContentParts(this.contentParts)(client.js:752). Holes are stripped → nothing bad persists.An interrupted run persists the raw, un-filtered array.
When a stream is cut (Stop button, page refresh/navigation, network drop, or an upstream litellm/OpenAI-compatible proxy timeout closing the SSE connection), the resumable/partial save writes the live aggregator array without filtering:
The sparse hole →
null→ persisted to themessagescollection.The next turn replays it and crashes.
On the following message, history is loaded and
formatMessages.js:59passes the stored array through verbatim into the payload.formatAgentMessagesiteratescontentand readspart.typewith no null guard → 💥.Why files reproduce it deterministically, tool calls only sometimes
Attaching a PDF/PNG reliably triggers a tool call (extraction / OCR / file_search) and makes the response stream longer — a much bigger window for a disconnect/refresh/timeout to land mid-tool-call and hit the un-filtered partial save. A plain chat only produces a hole when a tool-call slot happens to never finalize, so it's sporadic. The deterministic ingredients are the sparse aggregator + the un-filtered partial save; the trigger is the interruption (which litellm timeouts make common).
Note on litellm
litellm does not create the null itself. As an OpenAI-compatible proxy it (a) routes through generic provider handling and (b) drops/early-closes streams on upstream timeouts — which is exactly what fires the interrupted-save path in step 3. So it acts as a frequent trigger of a latent bug, not the cause.
The fix
Sanitize content holes at the single universal read chokepoint —
getMessagesinpackages/data-schemas/src/methods/message.ts. Every history load goes through it (BaseClient.loadHistory, the Responses-API controllers), so cleaning here protects every downstream consumer (formatAgentMessages, thecontent.find(part => part.type)atclient.js:325, the edit path atBaseClient.js:447, token counting) in one place.Why read-side rather than the save path: filtering at save (
request.js:166) would stop new corruption but leave every already-poisoned conversation crashing. Cleaning on read fixes existing rows and future ones, and makes any null that slips through harmless because no consumer ever sees it. It's a single-pass mutation over already-lean()'d objects — negligible cost, no new types.Scope: intentionally minimal. No
client.js/request.jschanges. Strippingnull/undefinedarray elements is always safe — those entries are never meaningful — and nothing relies on content-part holes once a message is read back.How to reproduce
Deterministic (no timing luck):
mongosh, poison an existing assistant message:Natural (the real path):
unfinishedmessage with a sparse hole.Library-level sanity check (inside the container):
Find already-corrupted rows in your DB:
Optional follow-ups (not in this PR)
These are independent hardening items, none required to fix the reported crash:
filterMalformedContentPartstoaggregatedContentat the interrupted save (request.js:166) so the DB stops storing holes at all. The read-side fix already makes the crash impossible; this just keeps stored data clean.convertMessages(packages/api/src/agents/openai/service.ts:205) maps overmsg.contentfrom the HTTP request body and readspart.typewith no null guard. This is reachable via the Responses/OpenAI-compatible passthrough (openai.js:463) and crashes on malformed client-supplied input — not on persisted rows — so it's a separate input-validation concern. One-line guard if desired:msg.content.filter((part) => part != null).map(...).@librechat/agents; a fix there (or a version bump if upstream has patched it) is the only thing that stops thenullbeing created.🤖 Generated with Claude Code