Skip to content

Feature Proposal: Add Image Position References to Document Chunks #136

@ibrahimGoumrane

Description

@ibrahimGoumrane

Summary

Enhance the chunking pipeline to include precise image-to-chunk mapping metadata, enabling accurate tracking of image positions within documents. This addresses current limitations where image presence is not indicated within the chunks and so we can't exactly know where that image live inside the document.


Current Limitation

In the existing implementation :

  • Images are not present in the chunks part of the output, they are only present in the documents.content.type_content.pictures

There is no precise information about:

  • Where the image lives exactly in the document.
  • Which chunk come before and after the image.
  • Limitation concerning handling image in a RAG pipeline.

Impact

This lack of granularity prevents:

  • Accurate image-to-chunk association
  • Reconstruction of documents with correct image placement
  • Effective use in RAG pipelines where contextual alignment matters
  • Downstream processing requiring image provenance or positioning

Proposed Enhancement

  • Images are serialized using a static placeholder:
chunk_i-1  ![Picture] chunk_i+1
  • Chunk metadata includes :

    {
      "has_image": true
    }
  • And so the output will look like this :

    {
      "text": "text \n![Picture]\n text",
      "headings": [
        "heading"
      ],   
     "page_numbers": [2],
      "metadata": {
        "origin": {
          "mimetype": "application/pdf",
          "binary_hash": 16467438883613526983,
          "filename": "file.pdf",
          "uri": null
        },
        "has_image": true
      }
    }
  • Like this we made sure we have enough information about our chunk images.

Benefits

  • Enables precise multimodal alignment in RAG systems
  • Supports document reconstruction with layout fidelity

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions