feat(cli): add --page-break-placeholder option for Markdown and Text exports#3184
Conversation
…exports Expose the existing `page_break_placeholder` parameter from the Python API (`save_as_markdown`) as a CLI option. When set, the specified string is inserted between pages in Markdown and Text outputs. Closes docling-project#3175
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling Content LayersView Suggested Changes@@ -109,6 +109,15 @@
Currently, the ability to include furniture in exports is only available via the Python API. The `docling-serve` API and CLI exports do not support specifying content layers and will always export with the default (BODY only).
+However, the CLI does support the `page_break_placeholder` parameter for Markdown and Text exports. You can specify a custom page break placeholder string when using the `docling convert` command with the `--page-break-placeholder` option:
+
+```bash
+docling my_document.pdf --to md --page-break-placeholder "---"
+docling my_document.pdf --to txt --page-break-placeholder "<!-- page-break -->"
+```
+
+When set, the specified string is inserted between pages in the output, allowing CLI users to control page break formatting in both Markdown and Text exports.
+
## Customization and Post-processing
Headers and footers are detected automatically by Docling’s layout model for `.docx` files. There is currently no rule-based mechanism to customize their detection during processing. However, you can manually remove or further process these elements after extraction if needed.Note: You must be authenticated to accept/decline updates. |
|
✅ DCO Check Passed Thanks @Krishnachaitanyakc, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@Krishnachaitanyakc Please follow the steps in #3184 (comment) to make sure your contribution is signed-off. |
…il.com> I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 667a168 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
|
Great feature! One enhancement that could be valuable: supporting dynamic page numbers in the placeholder string via Currently, the placeholder is static (the same string is inserted at every page break). But a common use case is annotating pages with their actual number, e.g.: docling --to md --page-break-placeholder '---\n*[Page {next_page}]*\n---' input.pdfWhich would produce: ... content from page 1 ...
---
*[Page 2]*
---
... content from page 2 ...The docling-core serializer internally tracks page numbers in its markers ( I propose a patch file attached. |
Add {prev_page} and {next_page} format variables to the
--page-break-placeholder option. When these variables are present,
each page break in the output is replaced with the placeholder
formatted with the actual page numbers for that specific break.
Example usage:
docling --to md --page-break-placeholder '--- Page {next_page} ---' input.pdf
Which produces:
... content from page 1 ...
--- Page 2 ---
... content from page 2 ...
--- Page 3 ---
... content from page 3 ...
Uses a sentinel-based approach: a unique sentinel is passed to
docling-core during serialization, then post-processed to replace
each sentinel occurrence with the formatted placeholder using
sequential page numbers. Static placeholders (without format
variables) continue to work unchanged.
|
I see you've made the modifications, that's really nice. One comment on my side: you use sequential counting instead of doc.pages. The point here, if you have a blank page then you will have wrong number. If you have a 5-pages document with a blank page in the middle, as the serializer only inserts a page break sentinel when there is content you will get: Proposed patch: Best regards and thank again for the reactivity. |
Use document item provenance to determine real page numbers instead of sequential counting. This fixes incorrect numbering when documents contain blank pages — e.g. a 5-page doc with a blank page 4 now correctly produces page numbers 1, 2, 3, 5 instead of 1, 2, 3, 4. Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
|
@smorand Thanks for the review and the great catch on blank page numbering. One note: rather than using doc.pages (which includes all pages the backend parsed, including blank ones), I'm using doc.iterate_items() to extract page numbers from item provenance. |
|
Pull, reviewed and tested, good to me. Thanks to you @Krishnachaitanyakc, because I was going to work and propose this feature, you save me time! |
Add assert for page_break_placeholder narrowing before calls to _apply_dynamic_page_breaks, and fix ruff-format string in test. Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
…il.com> I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 0914f4d Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
…il.com> I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 9bf673d Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
cau-git
left a comment
There was a problem hiding this comment.
@Krishnachaitanyakc From a functional perspective this looks useful, however we can not accept this approach. It is doing more than just threading through an option to the CLI. It actually patches the exported markdown after the production with the dynamic markers. If we want to support such behaviour it must be done with an actual change to the MarkdownSerializer in docling-core. Please let us know if you want to make a companion PR on docling-core on which this PR could depend after the appropriate changes.
|
Closing because of inactivity, please re-open if you intend to address the feedback. |
Summary
--page-break-placeholderCLI option to theconvertcommand, exposing the existingpage_break_placeholderparameter fromDoclingDocument.save_as_markdown()to CLI users.---,<!-- page-break -->) is inserted between pages in Markdown and Text exports.Closes #3175
Details
The Python API already supports
page_break_placeholderinsave_as_markdown()/export_to_markdown(), but the CLI did not expose this parameter. This change threads the option through:--page-break-placeholdertyper option on theconvertcommand (default:None, preserving current behavior)export_documentshelper functionsave_as_markdowncall sites (Markdown export and Text export)Usage
Test plan
test_cli_page_break_placeholdertest that verifies the CLI accepts the option and produces outputtest_cli_convertcontinues to pass (no regression without the flag)--page-break-placeholder) is unchanged since the default isNone