Skip to content

Commit 268d046

Browse files
Brooooooklynclaude
andauthored
fix: accept output_text, reasoning, and input_image in /v1/responses mapper (mlx-node#50)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Medium risk because it changes request/history mapping and VLM session/cache handling (image placeholders, M-RoPE offsets, and delta continuations), which can affect multi-turn correctness and replayed history integrity. > > **Overview** > **Expands the `/v1/responses` request mapper and stored-history codec to properly handle client replay and images.** `mapRequest` now accepts replayed assistant `output_text`, top-level `reasoning` items, and `refusal/summary_text` parts; it also decodes `input_image` base64 data URLs into `ChatMessage.images` and coalesces interleaved assistant `reasoning`/`message`/`function_call` items into a single assistant turn. > > **Adds strict text/image ordering guards** in both the OpenAI and Anthropic mappers to reject any text-after-image interleaving within a single user turn (previously the pipeline could silently reorder content). > > **Makes images survive persistence and VLM templating, and fixes Qwen3.5 VLM session correctness.** Stored `inputJson` is now written via `stringifyStoredInputMessages` (base64 sentinel) and revived during `reconstructMessagesFromChain` so `Uint8Array` images round-trip; tokenizer sanitization/serialization now preserves user images so Jinja templates can emit inline vision markers; Qwen3.5/Qwen3.5-MoE update image placeholder expansion (per-image counts), robust M-RoPE position indexing for multiple image runs, and session-delta handling so text-only deltas can continue on image-bearing sessions without clearing `cached_image_key`/rope deltas. Extensive new unit tests cover these behaviors. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 9c33817. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bfdeef5 commit 268d046

10 files changed

Lines changed: 2160 additions & 230 deletions

File tree

__test__/server/anthropic-request-mapper.test.ts

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,10 @@ describe('mapAnthropicRequest', () => {
263263
});
264264

265265
it('rejects trailing image followed by text after a tool_result prefix', () => {
266+
// The text-after-image guard fires during the main loop (before the
267+
// trailing-mixed check) because the serializer cannot preserve text
268+
// that follows an image in any turn, not just after a tool_result
269+
// prefix. Either rejection is equivalent for the caller.
266270
expect(() =>
267271
mapAnthropicRequest({
268272
model: 'claude-3-5-sonnet-20241022',
@@ -278,7 +282,7 @@ describe('mapAnthropicRequest', () => {
278282
},
279283
],
280284
}),
281-
).toThrow(/mixing trailing text and image blocks after a tool_result prefix/i);
285+
).toThrow(/text block after an image block in the same user turn/i);
282286
});
283287

284288
it('rejects multiple trailing image blocks after a tool_result prefix', () => {
@@ -857,4 +861,112 @@ describe('mapAnthropicRequest', () => {
857861

858862
expect(messages).toEqual([{ role: 'tool', content: 'Result: success', toolCallId: 'call_789' }]);
859863
});
864+
865+
describe('text/image ordering in pure user turns', () => {
866+
// Existing tool_result-prefix rejection at line ~160 catches mixed
867+
// trailing content after tool_result, but a PURE user turn (no
868+
// tool_result blocks) was previously silently concatenating all text
869+
// and stacking all images, reordering the caller's content. These
870+
// tests pin the uniform rejection for both call patterns.
871+
const png = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==';
872+
873+
it('rejects [image, text] in a pure user turn', () => {
874+
// Image-first-then-text gets reordered to text-first-then-image by
875+
// the flat ChatMessage + Jinja serializer pipeline. Reject rather
876+
// than silently rewrite the caller's intent.
877+
expect(() =>
878+
mapAnthropicRequest({
879+
model: 'claude-3-5-sonnet-20241022',
880+
max_tokens: 1024,
881+
messages: [
882+
{
883+
role: 'user',
884+
content: [
885+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
886+
{ type: 'text', text: 'describe this' },
887+
],
888+
},
889+
],
890+
}),
891+
).toThrow(/text block after an image block in the same user turn/i);
892+
});
893+
894+
it('rejects [text, image, text] in a pure user turn', () => {
895+
expect(() =>
896+
mapAnthropicRequest({
897+
model: 'claude-3-5-sonnet-20241022',
898+
max_tokens: 1024,
899+
messages: [
900+
{
901+
role: 'user',
902+
content: [
903+
{ type: 'text', text: 'before' },
904+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
905+
{ type: 'text', text: 'after' },
906+
],
907+
},
908+
],
909+
}),
910+
).toThrow(/text block after an image block/i);
911+
});
912+
913+
it('accepts [text, image] in a pure user turn (representable by the flat ChatMessage)', () => {
914+
const { messages } = mapAnthropicRequest({
915+
model: 'claude-3-5-sonnet-20241022',
916+
max_tokens: 1024,
917+
messages: [
918+
{
919+
role: 'user',
920+
content: [
921+
{ type: 'text', text: 'what colour is this?' },
922+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
923+
],
924+
},
925+
],
926+
});
927+
expect(messages).toHaveLength(1);
928+
expect(messages[0].role).toBe('user');
929+
expect(messages[0].content).toBe('what colour is this?');
930+
expect(messages[0].images).toHaveLength(1);
931+
});
932+
933+
it('accepts [text, text, image] (all text parts before the image)', () => {
934+
const { messages } = mapAnthropicRequest({
935+
model: 'claude-3-5-sonnet-20241022',
936+
max_tokens: 1024,
937+
messages: [
938+
{
939+
role: 'user',
940+
content: [
941+
{ type: 'text', text: 'part one. ' },
942+
{ type: 'text', text: 'part two.' },
943+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
944+
],
945+
},
946+
],
947+
});
948+
expect(messages).toHaveLength(1);
949+
expect(messages[0].content).toBe('part one. part two.');
950+
expect(messages[0].images).toHaveLength(1);
951+
});
952+
953+
it('accepts multiple images with no text (no ordering ambiguity)', () => {
954+
const { messages } = mapAnthropicRequest({
955+
model: 'claude-3-5-sonnet-20241022',
956+
max_tokens: 1024,
957+
messages: [
958+
{
959+
role: 'user',
960+
content: [
961+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
962+
{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: png } },
963+
],
964+
},
965+
],
966+
});
967+
expect(messages).toHaveLength(1);
968+
expect(messages[0].content).toBe('');
969+
expect(messages[0].images).toHaveLength(2);
970+
});
971+
});
860972
});

0 commit comments

Comments
 (0)