You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: first table chunk preserve col/row span (#4343)
### Summary
When a `Table` element was split into multiple `TableChunk`s, the
**first** chunk lost `colspan` / `rowspan` on its header
cells while continuation chunks kept them — producing inconsistent
merged-cell structure across a single split table.
Root cause: `HtmlTable.from_html_text` calls `e.attrib.clear()` on every
descendant during compactification, intending to drop
cosmetic attributes (`border`, `class`, `style`, …). That also wiped
`colspan`/`rowspan`. Continuation chunks escaped the bug
because their repeated headers come from `source_row_htmls` captured
*before* the strip; the first chunk's rows flow through
the compactified tree directly.
### Fix
Preserve `colspan` and `rowspan` through the `attrib.clear()` step in
`unstructured/common/html_table.py`. They're structural,
not cosmetic, and the rest of the chunking pipeline then carries them
through unchanged.
### Tests
-
`test_html_table.py::but_it_preserves_colspan_and_rowspan_as_structural_cell_attributes`
— unit guarantee that
compactification keeps the two span attributes and still strips
`class`/`style`/`id`/`data-*`.
-
`test_base.py::and_it_preserves_colspan_and_rowspan_in_the_first_chunk_header_rows`
— end-to-end regression: the first
`TableChunk` of a split table keeps `colspan`/`rowspan` on header cells,
matching continuation chunks.
Both tests fail on `main` and pass with this change.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Low Risk**
> Low risk, localized change to table HTML normalization that only
preserves two structural attributes; main impact is on downstream
`TableChunk` HTML fidelity for merged headers.
>
> **Overview**
> Fixes a regression where `HtmlTable.from_html_text()` compactification
stripped structural `colspan`/`rowspan`, causing the *first* chunk of a
split table to lose merged-header layout while continuation chunks
retained it.
>
> Updates the attribute-stripping step to preserve `colspan` and
`rowspan`, adds unit + end-to-end chunking regression tests to assert
consistent header spans across chunks, and bumps the release to
`0.22.23` with a changelog entry.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
5463e77. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Copy file name to clipboardExpand all lines: CHANGELOG.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,9 @@
1
+
## 0.22.23
2
+
3
+
### Fixes
4
+
5
+
-**Preserve `colspan`/`rowspan` in first table chunk headers**: `HtmlTable` compactification no longer strips `colspan` and `rowspan` attributes from table cells. Previously, the first `TableChunk` lost merged-cell structural information while continuation chunks retained it (via the source-HTML path used for repeated headers), yielding inconsistent header layout across a split table.
0 commit comments