fix: prevent footer detection from absorbing distant body text#386
fix: prevent footer detection from absorbing distant body text#386
Conversation
Body text that repeats across adjacent pages (e.g. a "※" note appearing on consecutive pages) was incorrectly classified as footer content by the cross-page pattern matcher. Add a spatial proximity check so that footer expansion stops when the next candidate element is more than 30pt away from the previously accepted footer element. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
WalkthroughIntroduced a 30.0-unit vertical adjacency constraint for header/footer candidate selection: candidates farther than the threshold from the previously accepted header/footer are rejected (inserted as null) to avoid absorbing nearby body text. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.java`:
- Around line 205-209: The gap calculation currently uses outer edges and
overstates whitespace; replace it to use the nearest adjacent edges between
previousElement and candidate by computing the minimum distance of the two
possible edge pairs. Specifically, in the block that assigns gap (referencing
isHeaderDetection, previousElement, candidate, getTopY(), getBottomY()), set gap
= min(abs(previousElement.getBottomY() - candidate.getTopY()),
abs(previousElement.getTopY() - candidate.getBottomY())) so the nearest edges
determine the whitespace.
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java`:
- Around line 84-158: The test testRepeatedBodyTextNotAbsorbedIntoFooter
currently only exercises the "too far" gap case and misses verifying that
closely spaced footer lines are grouped; update this test to add a
positive-control scenario where a footer has two lines close together (e.g., one
TextLine with bounding box y=35–44 and another with y=55–67) so
HeaderFooterProcessor.processHeadersAndFooters is forced to group them as a
single footer; modify the constructed page contents for at least one page to
include both footer lines (keep using TextLine/TextChunk and BoundingBox as in
the test) and then assert that the resulting SemanticHeaderOrFooter for that
page contains both lines (size == 2) and still has SemanticType.FOOTER, while
leaving other assertions intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f4c01c7-0863-41ce-906b-5bcf32ad51bb
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java
Address CodeRabbit review feedback: - Gap calculation now uses nearest edges (bottomY-topY) instead of outer edges, preventing multi-line footers with <30pt actual spacing from being incorrectly rejected. - Add testCloseFooterLinesAreGrouped to verify that two footer lines 11pt apart are grouped into a single footer element. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java`:
- Around line 35-40: Move the manual test setup in initContainers() into a JUnit
`@BeforeEach` method so it runs automatically before each test in
HeaderFooterProcessorTest; replace calls to initContainers() from individual
test methods by annotating a new or the existing initContainers() with
`@BeforeEach` (import org.junit.jupiter.api.BeforeEach) and keep the same body
that sets StaticContainers.setIsDataLoader(...),
StaticContainers.setIsIgnoreCharactersWithoutUnicode(...),
StaticResources.setDocument(null), and StaticLayoutContainers.clearContainers().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d6f34404-b3cd-442d-84da-c16357d8b71c
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.javajava/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java
Objective
Body text that repeats on adjacent pages — such as
※ 출수 중 출수 버튼을 터치하면 출수가 정지됩니다.on pages 19–20 of the reporter's PDF — is incorrectly classified as footer content, causing it to disappear from Markdown output (#385, parent #354).Approach
Add a spatial proximity check in
HeaderFooterProcessor.getNumberOfHeaderOrFooterContentsForEachPage(). When expanding the footer region upward (or header region downward), the next candidate element must be within 30pt of the previously accepted element. If the gap exceeds this threshold, expansion stops for that page.This prevents the cross-page pattern matcher from pulling distant body text into the footer just because it happens to repeat across pages.
Evidence
Converted the reporter's 52-page CERAGEM BALANCE PDF with the fixed JAR:
Fixes #385
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests