Commit c68c696
fix(hybrid): match bare "Page N" in failed-page error parser (#467)
* Fix: detect failed pages via error messages and gap detection union
Docling may include failed pages as empty entries in the pages dict,
making gap-only detection miss them. Add error message parsing
("Page N: <error>") and union both strategies so neither failure mode
is lost.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
* Fix: correct misleading docstring in missing pages key test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
* Add: overlap dedup test for gap and error parsing union
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
* fix(hybrid): match bare "Page N" in failed-page error parser
Docling's StandardPdfPipeline emits "Page N: <error>" when an error
description is available, or a bare "Page N" with no colon when the
error description is empty (standard_pdf_pipeline.py: error_msg falsy
branch). The previous regex r"^Page\s+(\d+):" required a colon and
silently dropped the bare form.
Relax to r"^Page\s+(\d+)(?::|$)" so both forms extract the page number.
Defensive only: in the wrapper's current pipeline this branch is rare
because failed_pages always carry a non-empty RuntimeError message, but
the change costs nothing and protects against future docling/wrapper
changes (e.g. switch to PaginatedPipeline / VlmPipeline, or new error
paths that omit the description).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
---------
Signed-off-by: i-Veni-Vidi-Vici <biuld1234@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 60101c5 commit c68c696
2 files changed
Lines changed: 7 additions & 2 deletions
File tree
- python/opendataloader-pdf
- src/opendataloader_pdf
- tests
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
121 | | - | |
| 121 | + | |
| 122 | + | |
122 | 123 | | |
123 | 124 | | |
124 | 125 | | |
| |||
129 | 130 | | |
130 | 131 | | |
131 | 132 | | |
132 | | - | |
| 133 | + | |
133 | 134 | | |
134 | 135 | | |
135 | 136 | | |
| |||
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
210 | 214 | | |
211 | 215 | | |
212 | 216 | | |
| |||
0 commit comments