You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: gracefully handle invalide html string during chunking (#4243)
This PR fixes an issue where an invalid `text_as_html` input into html
based table chunking logic can lead to chunking failing. Like the
following stack trace shows:
```
| File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks
| yield from _TableChunker.iter_chunks(
| File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks
| html_size = measure(self._html) if self._html else 0
| ^^^^^^^^^^
| File "/app/unstructured/unstructured/utils.py", line 154, in __get__
| value = self._fget(obj)
| ^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html
| if not (html_table := self._html_table):
| ^^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/utils.py", line 154, in __get__
| value = self._fget(obj)
| ^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table
| return HtmlTable.from_html_text(text_as_html)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text
| root = fragment_fromstring(html_text)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring
| elements = fragments_fromstring(
| ^^^^^^^^^^^^^^^^^^^^^
| File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring
| raise etree.ParserError(
| lxml.etree.ParserError: There is leading text: '```html\n'
```
The solution is to catch the parser error and return a `None` instead in
`unstructured/chunking/base.py` in `_html_table`. This way we fallback
to text based chunking for this element with a warning log.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,8 @@
1
+
## 0.20.5
2
+
3
+
### Fixes
4
+
-**Gracefully handle invalid `text_as_html` during chunking**: `_TableChunker` now catches parse errors (e.g. `lxml.etree.ParserError` when `text_as_html` contains a markdown code-fence like `` ```html\n ``) and returns `None` instead of raising, allowing chunking to continue using plain-text fallback. A `WARNING` log is emitted with a truncated preview of the offending value.
0 commit comments