fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata#3276
Merged
bogdankostic merged 2 commits intoMay 11, 2026
Conversation
…ber to chunk metadata - ExportType.MARKDOWN is now the default (was DOC_CHUNKS), aligning with Haystack convention of separating conversion from chunking - MetaExtractor.extract_chunk_meta now extracts page_number from chunk provenance, making metadata consistent with other Haystack splitters
Contributor
Author
|
hey @bogdankostic Resolves #3256 happy to get feedback on this! |
bogdankostic
requested changes
May 6, 2026
Contributor
bogdankostic
left a comment
There was a problem hiding this comment.
Thank you @SyedShahmeerAli12! I added a comment about reverting the additions to the changelog as these are added automatically.
Also, I was wondering if we could add more metadata as pointed out in the issue like split_id and split_start_idx.
Contributor
There was a problem hiding this comment.
Please revert these changes - the changelog will be populated automatically when a new released is triggered.
Contributor
Author
|
@bogdankostic Both points addressed ......
|
bogdankostic
approved these changes
May 11, 2026
Contributor
bogdankostic
left a comment
There was a problem hiding this comment.
Thanks @SyedShahmeerAli12, looking good to me! :)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
ExportType.MARKDOWNis now the default export type (previouslyDOC_CHUNKS), aligningDoclingConverterwith Haystack's convention of separating conversion from chunking. Users who want chunked output should passexport_type=ExportType.DOC_CHUNKSexplicitly.MetaExtractor.extract_chunk_meta()now extractspage_numberfrom chunk provenance info, making chunk metadata consistent with other Haystack splitters likeDocumentSplitter.How did you test it?
test_extract_chunk_meta_includes_page_numberandtest_extract_chunk_meta_page_number_uses_minimumNotes for the reviewer
export_typehas changed fromDOC_CHUNKStoMARKDOWN. Existing pipelines that relied on the default without setting it explicitly will need to addexport_type=ExportType.DOC_CHUNKS.dl_metais preserved in chunk metadata for backward compatibility alongside the newpage_numberfield.Checklist
fix: