Skip to content

fix: first table chunk preserve col/row span#4343

Merged
badGarnet merged 1 commit intomainfrom
fix/chunker-drop-col-row-span-for-first-table-chunk
Apr 24, 2026
Merged

fix: first table chunk preserve col/row span#4343
badGarnet merged 1 commit intomainfrom
fix/chunker-drop-col-row-span-for-first-table-chunk

Conversation

@badGarnet
Copy link
Copy Markdown
Collaborator

@badGarnet badGarnet commented Apr 24, 2026

Summary

When a Table element was split into multiple TableChunks, the first chunk lost colspan / rowspan on its header
cells while continuation chunks kept them — producing inconsistent merged-cell structure across a single split table.

Root cause: HtmlTable.from_html_text calls e.attrib.clear() on every descendant during compactification, intending to drop
cosmetic attributes (border, class, style, …). That also wiped colspan/rowspan. Continuation chunks escaped the bug
because their repeated headers come from source_row_htmls captured before the strip; the first chunk's rows flow through
the compactified tree directly.

Fix

Preserve colspan and rowspan through the attrib.clear() step in unstructured/common/html_table.py. They're structural,
not cosmetic, and the rest of the chunking pipeline then carries them through unchanged.

Tests

  • test_html_table.py::but_it_preserves_colspan_and_rowspan_as_structural_cell_attributes — unit guarantee that
    compactification keeps the two span attributes and still strips class/style/id/data-*.
  • test_base.py::and_it_preserves_colspan_and_rowspan_in_the_first_chunk_header_rows — end-to-end regression: the first
    TableChunk of a split table keeps colspan/rowspan on header cells, matching continuation chunks.

Both tests fail on main and pass with this change.


Note

Low Risk
Low risk, localized change to table HTML normalization that only preserves two structural attributes; main impact is on downstream TableChunk HTML fidelity for merged headers.

Overview
Fixes a regression where HtmlTable.from_html_text() compactification stripped structural colspan/rowspan, causing the first chunk of a split table to lose merged-header layout while continuation chunks retained it.

Updates the attribute-stripping step to preserve colspan and rowspan, adds unit + end-to-end chunking regression tests to assert consistent header spans across chunks, and bumps the release to 0.22.23 with a changelog entry.

Reviewed by Cursor Bugbot for commit 5463e77. Bugbot is set up for automated code reviews on this repo. Configure here.

@badGarnet badGarnet enabled auto-merge April 24, 2026 17:19
@badGarnet badGarnet added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit 879e126 Apr 24, 2026
53 of 54 checks passed
@badGarnet badGarnet deleted the fix/chunker-drop-col-row-span-for-first-table-chunk branch April 24, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants