Skip to content

Commit c68c696

Browse files
fix(hybrid): match bare "Page N" in failed-page error parser (#467)
* Fix: detect failed pages via error messages and gap detection union Docling may include failed pages as empty entries in the pages dict, making gap-only detection miss them. Add error message parsing ("Page N: <error>") and union both strategies so neither failure mode is lost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com> * Fix: correct misleading docstring in missing pages key test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com> * Add: overlap dedup test for gap and error parsing union Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com> * fix(hybrid): match bare "Page N" in failed-page error parser Docling's StandardPdfPipeline emits "Page N: <error>" when an error description is available, or a bare "Page N" with no colon when the error description is empty (standard_pdf_pipeline.py: error_msg falsy branch). The previous regex r"^Page\s+(\d+):" required a colon and silently dropped the bare form. Relax to r"^Page\s+(\d+)(?::|$)" so both forms extract the page number. Defensive only: in the wrapper's current pipeline this branch is rare because failed_pages always carry a non-empty RuntimeError message, but the change costs nothing and protects against future docling/wrapper changes (e.g. switch to PaginatedPipeline / VlmPipeline, or new error paths that omit the description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com> --------- Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 60101c5 commit c68c696

2 files changed

Lines changed: 7 additions & 2 deletions

File tree

python/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,8 @@ def _extract_failed_pages_from_errors(errors: list[str]) -> list[int]:
118118
"""Extract failed page numbers from error messages.
119119
120120
Docling error messages follow the pattern "Page N: <error>" (e.g.,
121-
"Page 26: std::bad_alloc"). Even when docling includes failed pages
121+
"Page 26: std::bad_alloc") or, when no error description is available,
122+
a bare "Page N" with no colon. Even when docling includes failed pages
122123
in the pages dict as empty entries, the error messages reliably
123124
indicate which pages actually failed.
124125
@@ -129,7 +130,7 @@ def _extract_failed_pages_from_errors(errors: list[str]) -> list[int]:
129130
Sorted list of 1-indexed page numbers that failed.
130131
"""
131132
failed = set()
132-
page_pattern = re.compile(r"^Page\s+(\d+):")
133+
page_pattern = re.compile(r"^Page\s+(\d+)(?::|$)")
133134
for msg in errors:
134135
m = page_pattern.match(msg)
135136
if m:

python/opendataloader-pdf/tests/test_hybrid_server_partial_success.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,10 @@ def test_no_page_errors(self):
207207
]
208208
assert _extract_failed_pages_from_errors(errors) == []
209209

210+
def test_bare_page_no_colon(self):
211+
"""Bare 'Page N' (no colon) should be matched when error_msg is empty."""
212+
assert _extract_failed_pages_from_errors(["Page 26"]) == [26]
213+
210214

211215
class TestBuildConversionResponseErrorParsing:
212216
"""Tests for failed page detection via error message parsing.

0 commit comments

Comments
 (0)