Skip to content

Commit 67f99b8

Browse files
authored
docs: highlight image support in Agents (#11291)
1 parent 6a9f89a commit 67f99b8

4 files changed

Lines changed: 212 additions & 0 deletions

File tree

docs-website/docs/concepts/agents.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ Key capabilities include:
4545
- **Human-in-the-loop**: Intercept tool calls for human review before execution. See [Human in the Loop](../pipeline-components/agents-1/human-in-the-loop.mdx).
4646
- **Multi-agent systems**: Wrap an `Agent` as a `ComponentTool` to build coordinator/specialist architectures. See [Multi-Agent Systems](./agents/multi-agent-systems.mdx).
4747
- **MCP server exposure**: Expose your agent as an MCP server using [Hayhooks](../development/hayhooks.mdx), making it callable from any MCP-compatible client such as Claude Desktop or Cursor.
48+
- **Multimodal inputs**: Pass images alongside text using `ImageContent` in `ChatMessage` content parts, or return `ImageContent` from tools for dynamic image analysis. Requires a vision-capable model such as `gpt-5` or `gemini-2.5-flash`. See [Multimodal Inputs](../pipeline-components/agents-1/agent.mdx#multimodal-inputs).
4849

4950
Check out the [Agent](../pipeline-components/agents-1/agent.mdx) documentation, or the [example](#tool-calling-agent) below to get started.
5051

docs-website/docs/pipeline-components/agents-1/agent.mdx

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,109 @@ See our [Streaming Support](../generators/guides-to-generators/choosing-the-righ
334334
Give preference to `print_streaming_chunk` by default.
335335
Write a custom callback only if you need a specific transport (for example, SSE/WebSocket) or custom UI formatting.
336336

337+
## Multimodal Inputs
338+
339+
Agents support multimodal inputs when paired with a vision-capable model such as `gpt-5` (OpenAI) or `gemini-2.5-flash` (Google).
340+
Pass images alongside text by including `ImageContent` objects in the `content_parts` of a `ChatMessage`:
341+
342+
```python
343+
from haystack.dataclasses import ChatMessage, ImageContent
344+
345+
image = ImageContent.from_url("https://example.com/chart.png")
346+
result = agent.run(
347+
messages=[
348+
ChatMessage.from_user(content_parts=["What does this chart show?", image]),
349+
],
350+
)
351+
```
352+
353+
Tools can also return `ImageContent` directly, letting the agent fetch and reason about images dynamically during its loop.
354+
Two things are required: set `outputs_to_string={"raw_result": True}` so the `ToolInvoker` skips string conversion, and return a `list[ImageContent]` (the tool result type is `str | Sequence[TextContent | ImageContent]`).
355+
356+
The standard Chat Completions API doesn't support images in tool results — use `OpenAIResponsesChatGenerator` (OpenAI's Responses API) instead:
357+
358+
```python
359+
from typing import Annotated
360+
from haystack.components.agents import Agent
361+
from haystack.components.generators.chat import OpenAIResponsesChatGenerator
362+
from haystack.dataclasses import ChatMessage, ImageContent
363+
from haystack.tools import tool
364+
365+
366+
@tool(outputs_to_string={"raw_result": True})
367+
def fetch_image(
368+
url: Annotated[str, "URL of the image to fetch and analyze"],
369+
) -> list[ImageContent]:
370+
"""Fetch an image from a URL so the agent can analyze its contents."""
371+
return [ImageContent.from_url(url)]
372+
373+
374+
agent = Agent(
375+
chat_generator=OpenAIResponsesChatGenerator(model="gpt-5"),
376+
tools=[fetch_image],
377+
system_prompt="You are a helpful assistant that can fetch and analyze images from URLs.",
378+
)
379+
380+
result = agent.run(
381+
messages=[
382+
ChatMessage.from_user(
383+
"Fetch the image at https://picsum.photos/seed/haystack/640/480 and describe what you see.",
384+
),
385+
],
386+
)
387+
print(result["last_message"].text)
388+
```
389+
390+
`ImageContent` can be created from a URL, a local file path, or a PDF page using the `PDFToImageContent` converter.
391+
392+
### In a pipeline
393+
394+
When an `Agent` sits inside a pipeline, use `ChatPromptBuilder` with its string template format and the `| templatize_part` filter to pass images as structured content parts:
395+
396+
```python
397+
from haystack import Pipeline
398+
from haystack.components.agents import Agent
399+
from haystack.components.builders import ChatPromptBuilder
400+
from haystack.components.generators.chat import OpenAIChatGenerator
401+
from haystack.dataclasses import ImageContent
402+
403+
template = """
404+
{% message role="user" %}
405+
{{ question }}
406+
{{ image | templatize_part }}
407+
{% endmessage %}
408+
"""
409+
410+
agent = Agent(
411+
chat_generator=OpenAIChatGenerator(model="gpt-5"),
412+
system_prompt="You are a helpful assistant that can analyze images.",
413+
)
414+
prompt_builder = ChatPromptBuilder(
415+
template=template,
416+
required_variables=["question", "image"],
417+
)
418+
419+
pipeline = Pipeline()
420+
pipeline.add_component("prompt_builder", prompt_builder)
421+
pipeline.add_component("agent", agent)
422+
pipeline.connect("prompt_builder.prompt", "agent.messages")
423+
424+
# Download or provide your own chart image as "chart.png"
425+
image = ImageContent.from_file_path("chart.png")
426+
result = pipeline.run(
427+
{
428+
"prompt_builder": {"question": "What does this chart show?", "image": image},
429+
},
430+
)
431+
print(result["agent"]["last_message"].text)
432+
```
433+
434+
:::tip
435+
See these cookbooks for complete multimodal agent examples:
436+
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent) — image inputs and tool use with agents
437+
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag) — vision model in a RAG pipeline
438+
:::
439+
337440
## Multi-Agent Systems
338441

339442
You can wrap an `Agent` as a tool to build multi-agent systems where specialist agents handle focused subtasks and a coordinator agent plans and delegates.
@@ -363,3 +466,5 @@ Agents work with MCP in two directions:
363466
🧑‍🍳 Cookbook:
364467

365468
- [Build a GitHub Issue Resolver Agent](https://haystack.deepset.ai/cookbook/github_issue_resolver_agent)
469+
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent)
470+
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)

docs-website/versioned_docs/version-2.28/concepts/agents.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ Key capabilities include:
4545
- **Human-in-the-loop**: Intercept tool calls for human review before execution. See [Human in the Loop](../pipeline-components/agents-1/human-in-the-loop.mdx).
4646
- **Multi-agent systems**: Wrap an `Agent` as a `ComponentTool` to build coordinator/specialist architectures. See [Multi-Agent Systems](./agents/multi-agent-systems.mdx).
4747
- **MCP server exposure**: Expose your agent as an MCP server using [Hayhooks](../development/hayhooks.mdx), making it callable from any MCP-compatible client such as Claude Desktop or Cursor.
48+
- **Multimodal inputs**: Pass images alongside text using `ImageContent` in `ChatMessage` content parts, or return `ImageContent` from tools for dynamic image analysis. Requires a vision-capable model such as `gpt-5` or `gemini-2.5-flash`. See [Multimodal Inputs](../pipeline-components/agents-1/agent.mdx#multimodal-inputs).
4849

4950
Check out the [Agent](../pipeline-components/agents-1/agent.mdx) documentation, or the [example](#tool-calling-agent) below to get started.
5051

docs-website/versioned_docs/version-2.28/pipeline-components/agents-1/agent.mdx

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,109 @@ See our [Streaming Support](../generators/guides-to-generators/choosing-the-righ
334334
Give preference to `print_streaming_chunk` by default.
335335
Write a custom callback only if you need a specific transport (for example, SSE/WebSocket) or custom UI formatting.
336336

337+
## Multimodal Inputs
338+
339+
Agents support multimodal inputs when paired with a vision-capable model such as `gpt-5` (OpenAI) or `gemini-2.5-flash` (Google).
340+
Pass images alongside text by including `ImageContent` objects in the `content_parts` of a `ChatMessage`:
341+
342+
```python
343+
from haystack.dataclasses import ChatMessage, ImageContent
344+
345+
image = ImageContent.from_url("https://example.com/chart.png")
346+
result = agent.run(
347+
messages=[
348+
ChatMessage.from_user(content_parts=["What does this chart show?", image]),
349+
],
350+
)
351+
```
352+
353+
Tools can also return `ImageContent` directly, letting the agent fetch and reason about images dynamically during its loop.
354+
Two things are required: set `outputs_to_string={"raw_result": True}` so the `ToolInvoker` skips string conversion, and return a `list[ImageContent]` (the tool result type is `str | Sequence[TextContent | ImageContent]`).
355+
356+
The standard Chat Completions API doesn't support images in tool results — use `OpenAIResponsesChatGenerator` (OpenAI's Responses API) instead:
357+
358+
```python
359+
from typing import Annotated
360+
from haystack.components.agents import Agent
361+
from haystack.components.generators.chat import OpenAIResponsesChatGenerator
362+
from haystack.dataclasses import ChatMessage, ImageContent
363+
from haystack.tools import tool
364+
365+
366+
@tool(outputs_to_string={"raw_result": True})
367+
def fetch_image(
368+
url: Annotated[str, "URL of the image to fetch and analyze"],
369+
) -> list[ImageContent]:
370+
"""Fetch an image from a URL so the agent can analyze its contents."""
371+
return [ImageContent.from_url(url)]
372+
373+
374+
agent = Agent(
375+
chat_generator=OpenAIResponsesChatGenerator(model="gpt-5"),
376+
tools=[fetch_image],
377+
system_prompt="You are a helpful assistant that can fetch and analyze images from URLs.",
378+
)
379+
380+
result = agent.run(
381+
messages=[
382+
ChatMessage.from_user(
383+
"Fetch the image at https://picsum.photos/seed/haystack/640/480 and describe what you see.",
384+
),
385+
],
386+
)
387+
print(result["last_message"].text)
388+
```
389+
390+
`ImageContent` can be created from a URL, a local file path, or a PDF page using the `PDFToImageContent` converter.
391+
392+
### In a pipeline
393+
394+
When an `Agent` sits inside a pipeline, use `ChatPromptBuilder` with its string template format and the `| templatize_part` filter to pass images as structured content parts:
395+
396+
```python
397+
from haystack import Pipeline
398+
from haystack.components.agents import Agent
399+
from haystack.components.builders import ChatPromptBuilder
400+
from haystack.components.generators.chat import OpenAIChatGenerator
401+
from haystack.dataclasses import ImageContent
402+
403+
template = """
404+
{% message role="user" %}
405+
{{ question }}
406+
{{ image | templatize_part }}
407+
{% endmessage %}
408+
"""
409+
410+
agent = Agent(
411+
chat_generator=OpenAIChatGenerator(model="gpt-5"),
412+
system_prompt="You are a helpful assistant that can analyze images.",
413+
)
414+
prompt_builder = ChatPromptBuilder(
415+
template=template,
416+
required_variables=["question", "image"],
417+
)
418+
419+
pipeline = Pipeline()
420+
pipeline.add_component("prompt_builder", prompt_builder)
421+
pipeline.add_component("agent", agent)
422+
pipeline.connect("prompt_builder.prompt", "agent.messages")
423+
424+
# Download or provide your own chart image as "chart.png"
425+
image = ImageContent.from_file_path("chart.png")
426+
result = pipeline.run(
427+
{
428+
"prompt_builder": {"question": "What does this chart show?", "image": image},
429+
},
430+
)
431+
print(result["agent"]["last_message"].text)
432+
```
433+
434+
:::tip
435+
See these cookbooks for complete multimodal agent examples:
436+
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent) — image inputs and tool use with agents
437+
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag) — vision model in a RAG pipeline
438+
:::
439+
337440
## Multi-Agent Systems
338441

339442
You can wrap an `Agent` as a tool to build multi-agent systems where specialist agents handle focused subtasks and a coordinator agent plans and delegates.
@@ -363,3 +466,5 @@ Agents work with MCP in two directions:
363466
🧑‍🍳 Cookbook:
364467

365468
- [Build a GitHub Issue Resolver Agent](https://haystack.deepset.ai/cookbook/github_issue_resolver_agent)
469+
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent)
470+
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)

0 commit comments

Comments
 (0)