Skip to content

fix: promote implicit header rows into struct-tree tables#59

Open
abimaelmartell wants to merge 1 commit into
mainfrom
abimaelmartell/promote-implicit-header-rows
Open

fix: promote implicit header rows into struct-tree tables#59
abimaelmartell wants to merge 1 commit into
mainfrom
abimaelmartell/promote-implicit-header-rows

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

Summary

Some tagged PDFs mark only the data rows under <Table> / <TR> / <TD>, leaving the visible column headers above the grid untagged. Those headers then slip through to the heuristic fallback and reassemble into a spurious mini-table next to the real one (e.g. the Anthropic system card page 241 summary table: struct-tree covered 3 data rows but the header row was untagged).

After detecting a struct-tree table, scan unclaimed items within 60pt above its top. Items that start at one of the table's column left-edges are grouped by Y-row, coalesced per column across wrapped lines, and prepended as a single header row to the table's cells.

Guards

Protect against common false positives from page headers, captions, and stray text above the grid:

  • ≥75% of columns must be filled (rejects stray page headers or event banners that only touch one or two columns)
  • No cell may be a caption marker (Table 1, Appendix Table A2, Figure 2)
  • No cell may be prose (starts lowercase, > 12 words) or suspiciously short (< 3 chars)
  • No two filled cells may be identical (rejects banners that happen to align across columns)
  • Non-alphanumeric fragments (a lone . or -) are skipped before merging so a stray period can't lead a row

Impact

  • 1 pdf-evals snapshot updated — the Anthropic Mythos system card. Multiple "Summary of Claude's answers" tables now render with proper column headers instead of:
    1. an H1 mashing together the "Category" word with the title,
    2. a data table missing its header row,
    3. a stray 2-col mini-table below holding the wrapped header fragments.
  • Composite score 0.679 (flat +0.001), char_accuracy 0.783 → 0.786, other metrics flat.
  • Two new unit tests cover header merging and the false-positive rejection path.

Test plan

  • cargo fmt + cargo clippy --release -- -D warnings clean.
  • cargo test --release — 384 unit + 107 integration all pass.
  • pdf-evals bench.py score — all metrics up or flat.

🤖 Generated with Claude Code

Some tagged PDFs cover only the data rows under <Table> / <TR> /
<TD>, leaving the visible column headers above the grid untagged.
Those headers then slip through to the heuristic fallback and
reassemble into a spurious mini-table next to the real one.

After detecting a struct-tree table, scan unclaimed items within
60pt above its top. Items that start at one of the table's column
left-edges (within half a column width) are grouped by Y-row,
coalesced per column across wrapped lines, and prepended as a
single header row to the table's cells.

Guards protect against common false positives:
- ≥75% of columns must be filled (rejects stray page headers or
  event banners that only touch one or two columns)
- No cell may be a caption marker ("Table 1", "Appendix Table A2",
  "Figure 2"), prose (starts lowercase, > 12 words), too short
  (< 3 chars), or a duplicate of another filled cell
- Non-alphanumeric fragments (lone "." or "-") are skipped before
  merging so a stray period can't lead a row

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant