fix: first table chunk preserve col/row span#4343
Merged
Conversation
leah1985
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a
Tableelement was split into multipleTableChunks, the first chunk lostcolspan/rowspanon its headercells while continuation chunks kept them — producing inconsistent merged-cell structure across a single split table.
Root cause:
HtmlTable.from_html_textcallse.attrib.clear()on every descendant during compactification, intending to dropcosmetic attributes (
border,class,style, …). That also wipedcolspan/rowspan. Continuation chunks escaped the bugbecause their repeated headers come from
source_row_htmlscaptured before the strip; the first chunk's rows flow throughthe compactified tree directly.
Fix
Preserve
colspanandrowspanthrough theattrib.clear()step inunstructured/common/html_table.py. They're structural,not cosmetic, and the rest of the chunking pipeline then carries them through unchanged.
Tests
test_html_table.py::but_it_preserves_colspan_and_rowspan_as_structural_cell_attributes— unit guarantee thatcompactification keeps the two span attributes and still strips
class/style/id/data-*.test_base.py::and_it_preserves_colspan_and_rowspan_in_the_first_chunk_header_rows— end-to-end regression: the firstTableChunkof a split table keepscolspan/rowspanon header cells, matching continuation chunks.Both tests fail on
mainand pass with this change.Note
Low Risk
Low risk, localized change to table HTML normalization that only preserves two structural attributes; main impact is on downstream
TableChunkHTML fidelity for merged headers.Overview
Fixes a regression where
HtmlTable.from_html_text()compactification stripped structuralcolspan/rowspan, causing the first chunk of a split table to lose merged-header layout while continuation chunks retained it.Updates the attribute-stripping step to preserve
colspanandrowspan, adds unit + end-to-end chunking regression tests to assert consistent header spans across chunks, and bumps the release to0.22.23with a changelog entry.Reviewed by Cursor Bugbot for commit 5463e77. Bugbot is set up for automated code reviews on this repo. Configure here.