Context
Parent issue: #354 — reported by @jsmount on v2.0.2 with CERAGEM BALANCE user manual (52 pages).
Problem
Footer detection over-extends on pages where body text is spatially close to the footer region.
Pages 19–20: Footer bounding box extends to y=135 (h≈100) instead of the normal y=43 (h≈9). This absorbs the body note ※ 출수 중 출수 버튼을 터치하면 출수가 정지됩니다. (at y=116) into the footer element.
Page 21: Same document, same footer pattern — footer is correctly detected with compact bbox (h=9.2).
Pages 6, 20: Footer text (06 CERAGEM BALANCE USER MANUAL, 20 CERAGEM BALANCE USER MANUAL) leaks into Markdown output instead of being filtered.
Data
| Page |
Footer bbox height |
Body text absorbed? |
Footer text in MD? |
| 18 |
9.2 |
No |
No |
| 19 |
100.4 |
Yes (※ note) |
No |
| 20 |
100.4 |
Yes (※ note) |
Yes |
| 21 |
9.2 |
No |
No |
Reproduction
pip install opendataloader-pdf==2.0.2
# PDF: CERAGEM_BALANCE_사용설명서.pdf (in odl-test-fixtures/inbox/354-text-ordering-footer-detection/)
opendataloader-pdf convert --format json --pages 19,20,21 CERAGEM_BALANCE_사용설명서.pdf
# Check footer bounding box heights in JSON output
Labels
bug
Context
Parent issue: #354 — reported by @jsmount on v2.0.2 with CERAGEM BALANCE user manual (52 pages).
Problem
Footer detection over-extends on pages where body text is spatially close to the footer region.
Pages 19–20: Footer bounding box extends to y=135 (h≈100) instead of the normal y=43 (h≈9). This absorbs the body note
※ 출수 중 출수 버튼을 터치하면 출수가 정지됩니다.(at y=116) into the footer element.Page 21: Same document, same footer pattern — footer is correctly detected with compact bbox (h=9.2).
Pages 6, 20: Footer text (
06 CERAGEM BALANCE USER MANUAL,20 CERAGEM BALANCE USER MANUAL) leaks into Markdown output instead of being filtered.Data
Reproduction
Labels
bug