Skip to content

fix: prevent footer detection from absorbing distant body text#386

Merged
bundolee merged 2 commits intomainfrom
fix/385-footer-absorbs-body-text
Apr 6, 2026
Merged

fix: prevent footer detection from absorbing distant body text#386
bundolee merged 2 commits intomainfrom
fix/385-footer-absorbs-body-text

Conversation

@bundolee
Copy link
Copy Markdown
Contributor

@bundolee bundolee commented Apr 3, 2026

Objective

Body text that repeats on adjacent pages — such as ※ 출수 중 출수 버튼을 터치하면 출수가 정지됩니다. on pages 19–20 of the reporter's PDF — is incorrectly classified as footer content, causing it to disappear from Markdown output (#385, parent #354).

Approach

Add a spatial proximity check in HeaderFooterProcessor.getNumberOfHeaderOrFooterContentsForEachPage(). When expanding the footer region upward (or header region downward), the next candidate element must be within 30pt of the previously accepted element. If the gap exceeds this threshold, expansion stops for that page.

This prevents the cross-page pattern matcher from pulling distant body text into the footer just because it happens to repeat across pages.

Evidence

Converted the reporter's 52-page CERAGEM BALANCE PDF with the fixed JAR:

Scenario Before After
Page 19 footer height 100.4pt (2 children) 9.2pt (1 child)
Page 20 footer height 100.4pt (2 children) 9.2pt (1 child)
Page 21 footer (control) 9.2pt (1 child) 9.2pt (unchanged) ✅
※ note on page 19 Missing from MD (absorbed into footer) Present as body paragraph
※ note on page 20 Missing from MD (absorbed into footer) Present as body paragraph
※ note count in full MD 5 occurrences 7 occurrences (2 restored) ✅
Footer text leak in MD 0 0 (no regression) ✅
All other page footers (1–52) Correctly detected Correctly detected (no regression) ✅
Unit tests 34 pass 35 pass (1 new) ✅

Fixes #385

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved header/footer detection with stricter vertical adjacency checks to avoid merging non-adjacent lines.
    • Prevented repeated body text from being misclassified as footer content on multi-page documents.
  • Tests

    • Added tests validating that close footer lines are grouped and repeated body notes remain as body content.

Body text that repeats across adjacent pages (e.g. a "※" note appearing
on consecutive pages) was incorrectly classified as footer content by the
cross-page pattern matcher. Add a spatial proximity check so that footer
expansion stops when the next candidate element is more than 30pt away
from the previously accepted footer element.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...taloader/pdf/processors/HeaderFooterProcessor.java 83.33% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 3, 2026

Walkthrough

Introduced a 30.0-unit vertical adjacency constraint for header/footer candidate selection: candidates farther than the threshold from the previously accepted header/footer are rejected (inserted as null) to avoid absorbing nearby body text.

Changes

Cohort / File(s) Summary
Header/Footer Detection Logic
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.java
Added MAX_HEADER_FOOTER_GAP constant and a private isAdjacentToExistingHeaderOrFooter(...) helper. During per-page candidate construction, candidates are gated by vertical adjacency to the previously accepted header/footer; excessive gaps yield null slots instead of the candidate.
Header/Footer Processor Tests
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java
Refactored setup into initContainers() and updated existing test. Added testRepeatedBodyTextNotAbsorbedIntoFooter() and testCloseFooterLinesAreGrouped() to validate that nearby repeated notes are not absorbed and that closely spaced footer lines are grouped.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hnc-jglee
  • hyunhee-jo
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: preventing footer detection from absorbing distant body text, which directly addresses the core issue.
Linked Issues check ✅ Passed The code changes implement the spatial proximity check (30 pt gap threshold) for header/footer expansion as required by issue #385, with test coverage for the fix.
Out of Scope Changes check ✅ Passed All changes are scoped to HeaderFooterProcessor and its tests; they directly address the footer over-extension problem without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.java`:
- Around line 205-209: The gap calculation currently uses outer edges and
overstates whitespace; replace it to use the nearest adjacent edges between
previousElement and candidate by computing the minimum distance of the two
possible edge pairs. Specifically, in the block that assigns gap (referencing
isHeaderDetection, previousElement, candidate, getTopY(), getBottomY()), set gap
= min(abs(previousElement.getBottomY() - candidate.getTopY()),
abs(previousElement.getTopY() - candidate.getBottomY())) so the nearest edges
determine the whitespace.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java`:
- Around line 84-158: The test testRepeatedBodyTextNotAbsorbedIntoFooter
currently only exercises the "too far" gap case and misses verifying that
closely spaced footer lines are grouped; update this test to add a
positive-control scenario where a footer has two lines close together (e.g., one
TextLine with bounding box y=35–44 and another with y=55–67) so
HeaderFooterProcessor.processHeadersAndFooters is forced to group them as a
single footer; modify the constructed page contents for at least one page to
include both footer lines (keep using TextLine/TextChunk and BoundingBox as in
the test) and then assert that the resulting SemanticHeaderOrFooter for that
page contains both lines (size == 2) and still has SemanticType.FOOTER, while
leaving other assertions intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f4c01c7-0863-41ce-906b-5bcf32ad51bb

📥 Commits

Reviewing files that changed from the base of the PR and between cb0c5b5 and 1a855ca.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java

Address CodeRabbit review feedback:
- Gap calculation now uses nearest edges (bottomY-topY) instead of outer
  edges, preventing multi-line footers with <30pt actual spacing from
  being incorrectly rejected.
- Add testCloseFooterLinesAreGrouped to verify that two footer lines
  11pt apart are grouped into a single footer element.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java`:
- Around line 35-40: Move the manual test setup in initContainers() into a JUnit
`@BeforeEach` method so it runs automatically before each test in
HeaderFooterProcessorTest; replace calls to initContainers() from individual
test methods by annotating a new or the existing initContainers() with
`@BeforeEach` (import org.junit.jupiter.api.BeforeEach) and keep the same body
that sets StaticContainers.setIsDataLoader(...),
StaticContainers.setIsIgnoreCharactersWithoutUnicode(...),
StaticResources.setDocument(null), and StaticLayoutContainers.clearContainers().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d6f34404-b3cd-442d-84da-c16357d8b71c

📥 Commits

Reviewing files that changed from the base of the PR and between 1a855ca and cab6fdd.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HeaderFooterProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HeaderFooterProcessorTest.java

@bundolee bundolee merged commit 45912a5 into main Apr 6, 2026
11 of 12 checks passed
@bundolee bundolee deleted the fix/385-footer-absorbs-body-text branch April 6, 2026 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Footer detection absorbs body text on pages with nearby notes

2 participants