fix: promote implicit header rows into struct-tree tables#59
Open
abimaelmartell wants to merge 1 commit into
Open
fix: promote implicit header rows into struct-tree tables#59abimaelmartell wants to merge 1 commit into
abimaelmartell wants to merge 1 commit into
Conversation
Some tagged PDFs cover only the data rows under <Table> / <TR> /
<TD>, leaving the visible column headers above the grid untagged.
Those headers then slip through to the heuristic fallback and
reassemble into a spurious mini-table next to the real one.
After detecting a struct-tree table, scan unclaimed items within
60pt above its top. Items that start at one of the table's column
left-edges (within half a column width) are grouped by Y-row,
coalesced per column across wrapped lines, and prepended as a
single header row to the table's cells.
Guards protect against common false positives:
- ≥75% of columns must be filled (rejects stray page headers or
event banners that only touch one or two columns)
- No cell may be a caption marker ("Table 1", "Appendix Table A2",
"Figure 2"), prose (starts lowercase, > 12 words), too short
(< 3 chars), or a duplicate of another filled cell
- Non-alphanumeric fragments (lone "." or "-") are skipped before
merging so a stray period can't lead a row
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Some tagged PDFs mark only the data rows under
<Table>/<TR>/<TD>, leaving the visible column headers above the grid untagged. Those headers then slip through to the heuristic fallback and reassemble into a spurious mini-table next to the real one (e.g. the Anthropic system card page 241 summary table: struct-tree covered 3 data rows but the header row was untagged).After detecting a struct-tree table, scan unclaimed items within 60pt above its top. Items that start at one of the table's column left-edges are grouped by Y-row, coalesced per column across wrapped lines, and prepended as a single header row to the table's cells.
Guards
Protect against common false positives from page headers, captions, and stray text above the grid:
Table 1,Appendix Table A2,Figure 2).or-) are skipped before merging so a stray period can't lead a rowImpact
char_accuracy0.783 → 0.786, other metrics flat.Test plan
cargo fmt+cargo clippy --release -- -D warningsclean.cargo test --release— 384 unit + 107 integration all pass.bench.py score— all metrics up or flat.🤖 Generated with Claude Code