Summary
Enhance the chunking pipeline to include precise image-to-chunk mapping metadata, enabling accurate tracking of image positions within documents. This addresses current limitations where image presence is not indicated within the chunks and so we can't exactly know where that image live inside the document.
Current Limitation
In the existing implementation :
- Images are not present in the chunks part of the output, they are only present in the documents.content.type_content.pictures
There is no precise information about:
- Where the image lives exactly in the document.
- Which chunk come before and after the image.
- Limitation concerning handling image in a RAG pipeline.
Impact
This lack of granularity prevents:
- Accurate image-to-chunk association
- Reconstruction of documents with correct image placement
- Effective use in RAG pipelines where contextual alignment matters
- Downstream processing requiring image provenance or positioning
Proposed Enhancement
- Images are serialized using a static placeholder:
chunk_i-1 ![Picture] chunk_i+1
{
"text": "text \n![Picture]\n text",
"headings": [
"heading"
],
"page_numbers": [2],
"metadata": {
"origin": {
"mimetype": "application/pdf",
"binary_hash": 16467438883613526983,
"filename": "file.pdf",
"uri": null
},
"has_image": true
}
}
- Like this we made sure we have enough information about our chunk images.
Benefits
- Enables precise multimodal alignment in RAG systems
- Supports document reconstruction with layout fidelity
Summary
Enhance the chunking pipeline to include precise image-to-chunk mapping metadata, enabling accurate tracking of image positions within documents. This addresses current limitations where image presence is not indicated within the chunks and so we can't exactly know where that image live inside the document.
Current Limitation
In the existing implementation :
There is no precise information about:
Impact
This lack of granularity prevents:
Proposed Enhancement
Chunk metadata includes :
{ "has_image": true }And so the output will look like this :
{ "text": "text \n![Picture]\n text", "headings": [ "heading" ], "page_numbers": [2], "metadata": { "origin": { "mimetype": "application/pdf", "binary_hash": 16467438883613526983, "filename": "file.pdf", "uri": null }, "has_image": true } }Benefits