Skip to content

Footer detection absorbs body text on pages with nearby notes #385

@bundolee

Description

@bundolee

Context

Parent issue: #354 — reported by @jsmount on v2.0.2 with CERAGEM BALANCE user manual (52 pages).

Problem

Footer detection over-extends on pages where body text is spatially close to the footer region.

Pages 19–20: Footer bounding box extends to y=135 (h≈100) instead of the normal y=43 (h≈9). This absorbs the body note ※ 출수 중 출수 버튼을 터치하면 출수가 정지됩니다. (at y=116) into the footer element.

Page 21: Same document, same footer pattern — footer is correctly detected with compact bbox (h=9.2).

Pages 6, 20: Footer text (06 CERAGEM BALANCE USER MANUAL, 20 CERAGEM BALANCE USER MANUAL) leaks into Markdown output instead of being filtered.

Data

Page Footer bbox height Body text absorbed? Footer text in MD?
18 9.2 No No
19 100.4 Yes (※ note) No
20 100.4 Yes (※ note) Yes
21 9.2 No No

Reproduction

pip install opendataloader-pdf==2.0.2
# PDF: CERAGEM_BALANCE_사용설명서.pdf (in odl-test-fixtures/inbox/354-text-ordering-footer-detection/)
opendataloader-pdf convert --format json --pages 19,20,21 CERAGEM_BALANCE_사용설명서.pdf
# Check footer bounding box heights in JSON output

Labels

bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions