Add multimodal support #1138

joaomdmoura · 2024-01-12T01:31:00Z

joaomdmoura
Jan 12, 2024
Maintainer

Add the ability to video and audio, though both local and close models

m-da-costa · 2024-01-21T20:45:18Z

m-da-costa
Jan 21, 2024

Sorry, but what about llava and bakllava?

0 replies

joaomdmoura · 2024-01-21T20:46:21Z

joaomdmoura
Jan 21, 2024
Maintainer Author

Yup! I haven't tried yet but those should work, I just a better dsl to support them for real

0 replies

babycommando · 2024-02-13T15:36:19Z

babycommando
Feb 13, 2024

can't wait for multimodal support!

0 replies

jingchang0623-crypto · 2026-05-15T06:04:00Z

jingchang0623-crypto
May 15, 2026

Multimodal support would be a game-changer for CrewAI. We have been running a multi-agent setup on OpenClaw for a few months now, and here are some patterns we have found useful for adding vision/audio capabilities:

Vision Patterns We Use

Screenshot Analysis Agent — One agent takes screenshots of web pages, another analyzes the visual content. This is especially useful for UI testing and design QA.
Document Understanding — We pipe PDFs and images through a vision-capable LLM before passing structured data to the CrewAI agent.

Audio Use Cases

Transcribing meeting recordings → structuring with an agent
Analyzing customer support calls for sentiment
Generating audio summaries with TTS

Architecture Suggestion

Rather than bolting multimodal into CrewAI directly, consider a tool-based approach:

from crewai.tools import BaseTool

class VisionAnalysisTool(BaseTool):
    name = "vision_analysis"
    description = "Analyze images and screenshots"
    
    def _run(self, image_path: str) -> str:
        # Call vision-capable LLM (GPT-4V, Claude, etc.)
        # Return structured description
        pass

This keeps CrewAI focused on orchestration while letting specialized models handle the multimodal heavy lifting. Each agent can use the tool it needs without requiring the framework to understand every modality.

We wrote about some of these patterns at miaoquai.com — happy to share more details!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal support #1138

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Add multimodal support #1138

Uh oh!

joaomdmoura Jan 12, 2024 Maintainer

Replies: 4 comments

Uh oh!

m-da-costa Jan 21, 2024

Uh oh!

joaomdmoura Jan 21, 2024 Maintainer Author

Uh oh!

babycommando Feb 13, 2024

Uh oh!

jingchang0623-crypto May 15, 2026

Vision Patterns We Use

Audio Use Cases

Architecture Suggestion

joaomdmoura
Jan 12, 2024
Maintainer

m-da-costa
Jan 21, 2024

joaomdmoura
Jan 21, 2024
Maintainer Author

babycommando
Feb 13, 2024

jingchang0623-crypto
May 15, 2026