Skip to content

Commit f237607

Browse files
ericdalloeca-agent
andcommitted
Render and replay MCP tool image content end-to-end
MCP tools that return `image` content blocks (e.g. an MCP image-generation/edit server) previously surfaced as the placeholder text `[Image: <media-type>]` and were never replayed back to the LLM, so iterative edits ("now make it blue") were impossible. Client UI: - `chat/tool_calls.clj :send-toolCalled` now partitions tool outputs into text vs image, keeps text in the toolCalled :outputs (which the protocol declares text-only) and emits one `ChatImageContent` per image so clients render inline. - `chat.clj` `tool_call_output` history replay mirrors the same partitioning so reopened chats render images at the original spot. Provider replay: - `openai.clj` (Responses API) `normalize-messages` switches `keep` -> `mapcat` and, when the model supports image input, appends a synthetic user-role `input_image` after the `function_call_output` (the Responses-API tool output is text-only, mirroring the existing `image_generation_call` replay). - `anthropic.clj` carries images natively in `tool_result.content` as mixed `[{text}, {image base64}]` blocks when the model is multimodal, falling back to the legacy text-only string otherwise. - `openai_chat.clj` and `ollama.clj` get gating comments only; image round-trip is not implemented for those APIs yet. Docs: - `docs/protocol.md`: `ChatImageContent` doc broadened to cover MCP image tools alongside the existing `image_generation` tool. Tests: 481 / 2639 / 0. 🤖 Generated with [eca](https://eca.dev) Co-Authored-By: eca-agent <git@eca.dev>
1 parent 008d2da commit f237607

11 files changed

Lines changed: 335 additions & 100 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## Unreleased
44

5+
- MCP tools that return image content blocks (e.g. an MCP image-generation/edit server) now render those images in the chat UI as `ChatImageContent` and replay them back to the LLM as image inputs on follow-up turns when the model supports vision. Implemented for `openai-responses` (synthetic user-role `input_image` after the `function_call_output`) and `anthropic` (mixed text + image blocks inside `tool_result.content`). `openai-chat` and `ollama` continue to receive a text placeholder until a parallel pattern is implemented there.
56
- Bugfix: MCP tools without a `description` (which the MCP spec marks optional) no longer break Anthropic chat requests with `tools.<n>.custom.description: Input should be a valid string`. Missing/empty descriptions now fall back to the tool's `title`, then to a synthesized `MCP tool: <name>` string at the MCP boundary so all providers receive a non-null string.
67

78
## 0.131.0

docs/protocol.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -793,8 +793,12 @@ interface ChatURLContent {
793793
}
794794

795795
/**
796-
* Image content from the assistant, produced by a server-side image
797-
* generation tool (e.g. OpenAI's `image_generation` Responses-API tool).
796+
* Image content from the assistant. Produced either by a server-side
797+
* image generation tool (e.g. OpenAI's `image_generation` Responses-API
798+
* tool) or by an MCP tool whose result includes an `image` content block
799+
* (e.g. an MCP image-generation/edit server). In both cases, ECA emits
800+
* one `ChatImageContent` per image so clients can render without
801+
* inspecting tool-call outputs.
798802
*
799803
* The image bytes are delivered inline as base64 so that web/remote ECA
800804
* clients (e.g. web.eca.dev) can render without filesystem access.

src/eca/features/chat.clj

Lines changed: 51 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -151,38 +151,57 @@
151151
:details (:details message-content)
152152
:arguments-text ""
153153
:id (:id message-content)}}]
154-
"tool_call_output" [{:role :assistant
155-
:content (assoc-some
156-
{:type :toolCallRun
157-
:id (:id message-content)
158-
:name (:name message-content)
159-
:server (:server message-content)
160-
:origin (:origin message-content)
161-
:arguments (:arguments message-content)}
162-
:details (:details message-content)
163-
:summary (:summary message-content))}
164-
{:role :assistant
165-
:content (assoc-some
166-
{:type :toolCallRunning
167-
:id (:id message-content)
168-
:name (:name message-content)
169-
:server (:server message-content)
170-
:origin (:origin message-content)
171-
:arguments (:arguments message-content)}
172-
:details (:details message-content)
173-
:summary (:summary message-content))}
174-
{:role :assistant
175-
:content {:type :toolCalled
176-
:origin (:origin message-content)
177-
:name (:name message-content)
178-
:server (:server message-content)
179-
:arguments (:arguments message-content)
180-
:total-time-ms (:total-time-ms message-content)
181-
:summary (:summary message-content)
182-
:details (:details message-content)
183-
:error (:error message-content)
184-
:id (:id message-content)
185-
:outputs (:contents (:output message-content))}}]
154+
;; Mirror the live path in tool-calls.clj :send-toolCalled: split image
155+
;; outputs out of the toolCalled :outputs (which is text-only per
156+
;; protocol) and re-emit them as standalone ChatImageContent entries so
157+
;; reopened/resumed chats render MCP-produced images at the same point
158+
;; they appeared live.
159+
"tool_call_output" (let [contents (:contents (:output message-content))
160+
image? #(and (map? %) (= :image (:type %)))
161+
;; Only partition when contents is a sequence of content
162+
;; maps that includes images; otherwise pass through.
163+
image-outputs (when (sequential? contents) (filter image? contents))
164+
text-outputs (if (seq image-outputs)
165+
(vec (remove image? contents))
166+
contents)]
167+
(into [{:role :assistant
168+
:content (assoc-some
169+
{:type :toolCallRun
170+
:id (:id message-content)
171+
:name (:name message-content)
172+
:server (:server message-content)
173+
:origin (:origin message-content)
174+
:arguments (:arguments message-content)}
175+
:details (:details message-content)
176+
:summary (:summary message-content))}
177+
{:role :assistant
178+
:content (assoc-some
179+
{:type :toolCallRunning
180+
:id (:id message-content)
181+
:name (:name message-content)
182+
:server (:server message-content)
183+
:origin (:origin message-content)
184+
:arguments (:arguments message-content)}
185+
:details (:details message-content)
186+
:summary (:summary message-content))}
187+
{:role :assistant
188+
:content {:type :toolCalled
189+
:origin (:origin message-content)
190+
:name (:name message-content)
191+
:server (:server message-content)
192+
:arguments (:arguments message-content)
193+
:total-time-ms (:total-time-ms message-content)
194+
:summary (:summary message-content)
195+
:details (:details message-content)
196+
:error (:error message-content)
197+
:id (:id message-content)
198+
:outputs text-outputs}}]
199+
(map (fn [img]
200+
{:role :assistant
201+
:content {:type :image
202+
:media-type (:media-type img)
203+
:base64 (:base64 img)}}))
204+
image-outputs))
186205
"image_generation_call" [{:role :assistant
187206
:content {:type :image
188207
:media-type (:media-type message-content)

src/eca/features/chat/tool_calls.clj

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -307,19 +307,41 @@
307307
:summary (:summary event-data)))
308308

309309
:send-toolCalled
310-
(lifecycle/send-content! chat-ctx :assistant
311-
(assoc-some
312-
{:type :toolCalled
313-
:id tool-call-id
314-
:origin (:origin event-data)
315-
:name (:name event-data)
316-
:server (:server event-data)
317-
:arguments (:arguments event-data)
318-
:error (:error event-data)
319-
:total-time-ms (:total-time-ms event-data)
320-
:outputs (:outputs event-data)}
321-
:details (:details event-data)
322-
:summary (:summary event-data)))
310+
;; Tool results may include image content blocks (e.g. from MCP image
311+
;; tools). The protocol declares toolCalled :outputs as text-only, so we
312+
;; partition: text outputs go into the toolCalled event, images are
313+
;; emitted as separate ChatImageContent events tagged :assistant. This
314+
;; reuses the same wire shape produced by the OpenAI Responses-API
315+
;; built-in image_generation tool (see chat.clj on-message-received
316+
;; :image branch).
317+
;; Only partition when outputs is a sequence containing image content
318+
;; maps; otherwise pass it through unchanged. This preserves
319+
;; pre-existing behaviors where :outputs can be nil or a plain string
320+
;; (used by some non-MCP code paths and tests).
321+
(let [outputs (:outputs event-data)
322+
image? #(and (map? %) (= :image (:type %)))
323+
image-outputs (when (sequential? outputs) (filter image? outputs))
324+
text-outputs (if (seq image-outputs)
325+
(vec (remove image? outputs))
326+
outputs)]
327+
(lifecycle/send-content! chat-ctx :assistant
328+
(assoc-some
329+
{:type :toolCalled
330+
:id tool-call-id
331+
:origin (:origin event-data)
332+
:name (:name event-data)
333+
:server (:server event-data)
334+
:arguments (:arguments event-data)
335+
:error (:error event-data)
336+
:total-time-ms (:total-time-ms event-data)
337+
:outputs text-outputs}
338+
:details (:details event-data)
339+
:summary (:summary event-data)))
340+
(doseq [img image-outputs]
341+
(lifecycle/send-content! chat-ctx :assistant
342+
{:type :image
343+
:media-type (:media-type img)
344+
:base64 (:base64 img)})))
323345

324346
:send-toolCallRejected
325347
(let [tool-call-state (get-tool-call-state @db* (:chat-id chat-ctx) tool-call-id)

src/eca/llm_providers/anthropic.clj

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -183,10 +183,29 @@
183183
:input (or (:arguments content) {})}]}
184184

185185
"tool_call_output"
186-
{:role "user"
187-
:content [{:type "tool_result"
188-
:tool_use_id (:id content)
189-
:content (llm-util/stringfy-tool-result content)}]}
186+
;; Anthropic's tool_result `content` field accepts a list of
187+
;; mixed text + image blocks natively, so when the tool returned
188+
;; image content and the model supports image input, emit a
189+
;; single tool_result whose content is the textual portion plus
190+
;; one image block per image. Falls back to the legacy text-only
191+
;; string shape when there are no images or the model is not
192+
;; multimodal (preserves prior behavior).
193+
(let [contents (-> content :output :contents)
194+
image-contents (when supports-image?
195+
(seq (filter #(= :image (:type %)) contents)))
196+
text (llm-util/stringfy-tool-result content)]
197+
{:role "user"
198+
:content [{:type "tool_result"
199+
:tool_use_id (:id content)
200+
:content (if image-contents
201+
(into [{:type "text" :text text}]
202+
(map (fn [img]
203+
{:type "image"
204+
:source {:type "base64"
205+
:media_type (or (:media-type img) "image/png")
206+
:data (:base64 img)}}))
207+
image-contents)
208+
text)}]})
190209

191210
;; OpenAI-emitted image_generation_call history entries are
192211
;; replayed for Anthropic as user-role image blocks (Anthropic

src/eca/llm_providers/ollama.clj

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,10 @@
107107
:function (-> content
108108
(assoc :name (:full-name content))
109109
(dissoc :full-name))}]}
110+
;; NOTE: Image content from MCP tool results is flattened to
111+
;; placeholder text via `stringfy-tool-result`. Image round-trip
112+
;; is implemented for openai-responses and anthropic; see those
113+
;; providers' `tool_call_output` branches for the pattern.
110114
"tool_call_output" {:role "tool" :content (llm-util/stringfy-tool-result content)}
111115
"reason" {:role "assistant" :content (:text content)}
112116
{:role (:role msg)

0 commit comments

Comments
 (0)