Skip to content

[Feature Request] Native support for multi-modal RAG (text + images) #11037

@rehan243

Description

@rehan243

Feature Description

With the rise of vision-language models (GPT-4V, LLaVA, CogVLM), it would be valuable to have native multi-modal document support in Haystack pipelines.

Current Limitation

Currently, image content in PDFs/documents is lost during ingestion. Users need custom extractors to handle images alongside text.

Proposed Enhancement

  1. Multi-modal document parser that extracts text AND images
  2. Multi-modal embeddings (CLIP-style) for image chunks
  3. Multi-modal retriever that searches across text and image content
  4. VLM integration for answer generation from mixed context

Use Case

Technical documentation with diagrams, medical records with scans, financial reports with charts - all common enterprise use cases where image understanding is critical for accurate retrieval.

Would love to hear the team's thoughts on this direction!

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low priority, leave it in the backlog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions