Skip to content

Commit c1f819c

Browse files
authored
fix: gracefully handle invalide html string during chunking (#4243)
This PR fixes an issue where an invalid `text_as_html` input into html based table chunking logic can lead to chunking failing. Like the following stack trace shows: ``` | File "/app/unstructured/unstructured/chunking/base.py", line 594, in iter_chunks | yield from _TableChunker.iter_chunks( | File "/app/unstructured/unstructured/chunking/base.py", line 837, in _iter_chunks | html_size = measure(self._html) if self._html else 0 | ^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 866, in _html | if not (html_table := self._html_table): | ^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/utils.py", line 154, in __get__ | value = self._fget(obj) | ^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/chunking/base.py", line 884, in _html_table | return HtmlTable.from_html_text(text_as_html) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/unstructured/unstructured/common/html_table.py", line 61, in from_html_text | root = fragment_fromstring(html_text) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 810, in fragment_fromstring | elements = fragments_fromstring( | ^^^^^^^^^^^^^^^^^^^^^ | File "/app/.venv/lib/python3.12/site-packages/lxml/html/__init__.py", line 780, in fragments_fromstring | raise etree.ParserError( | lxml.etree.ParserError: There is leading text: '```html\n' ``` The solution is to catch the parser error and return a `None` instead in `unstructured/chunking/base.py` in `_html_table`. This way we fallback to text based chunking for this element with a warning log.
1 parent e1f75a3 commit c1f819c

12 files changed

Lines changed: 94 additions & 24 deletions

File tree

.github/workflows/ci.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -191,10 +191,10 @@ jobs:
191191
- uses: ./.github/actions/base-cache
192192
with:
193193
python-version: ${{ matrix.python-version }}
194-
- name: Setup docker-compose
195-
uses: KengoTODA/actions-setup-docker-compose@v1
196-
with:
197-
version: '2.22.0'
194+
- name: Set up Docker
195+
uses: docker/setup-docker-action@v4
196+
- name: Set up Docker Compose
197+
uses: docker/setup-compose-action@v1
198198
- name: Test (end-to-end)
199199
env:
200200
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,8 @@ jobs:
2929
- uses: ./.github/actions/base-cache
3030
with:
3131
python-version: ${{ env.PYTHON_VERSION }}
32-
- name: Setup docker-compose
33-
uses: KengoTODA/actions-setup-docker-compose@v1
34-
with:
35-
version: "2.22.0"
32+
- name: Set up Docker Compose
33+
uses: docker/setup-compose-action@v1
3634
- name: Update test fixtures
3735
env:
3836
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
## 0.20.5
2+
3+
### Fixes
4+
- **Gracefully handle invalid `text_as_html` during chunking**: `_TableChunker` now catches parse errors (e.g. `lxml.etree.ParserError` when `text_as_html` contains a markdown code-fence like `` ```html\n ``) and returns `None` instead of raising, allowing chunking to continue using plain-text fallback. A `WARNING` log is emitted with a truncated preview of the offending value.
5+
16
## 0.20.4
27

38
### Enhancements

scripts/kafka-test-helpers/create-kafka-instance.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ set -e
55
SCRIPT_DIR=$(dirname "$(realpath "$0")")
66

77
# Create the Weaviate instance
8-
docker-compose version
9-
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
10-
docker-compose -f "$SCRIPT_DIR"/docker-compose.yml ps
8+
docker compose version
9+
docker compose -f "$SCRIPT_DIR"/docker-compose.yml up --wait
10+
docker compose -f "$SCRIPT_DIR"/docker-compose.yml ps
1111

1212
echo "Instance is live."

test_unstructured/chunking/test_base.py

Lines changed: 63 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
from __future__ import annotations
66

7+
import logging
78
from typing import Any, Sequence
89

910
import pytest
@@ -1322,6 +1323,56 @@ def it_handles_table_with_none_text_without_error(self):
13221323

13231324
assert len(chunks) == 0
13241325

1326+
def it_handles_invalid_html_in_text_as_html_without_error(self, caplog):
1327+
"""Regression test: gracefully skip HTML-based chunking when text_as_html is not valid HTML.
1328+
1329+
`lxml` raises `ParserError` for strings that are not valid HTML fragments (e.g. plain text
1330+
with no tags). The chunker should log a warning and fall back to text-only chunking rather
1331+
than raising.
1332+
"""
1333+
table = Table(
1334+
"Header Col 1 Header Col 2\nLorem ipsum dolor sit amet",
1335+
metadata=ElementMetadata(text_as_html="not valid html"),
1336+
)
1337+
1338+
caplog.set_level(logging.WARNING)
1339+
# -- should not raise ParserError --
1340+
chunks = list(_TableChunker.iter_chunks(table, "", ChunkingOptions()))
1341+
1342+
# -- falls back to text-only: a single Table chunk with no .text_as_html --
1343+
assert len(chunks) == 1
1344+
chunk = chunks[0]
1345+
assert isinstance(chunk, Table)
1346+
assert chunk.metadata.text_as_html is None
1347+
assert len(caplog.records) == 1
1348+
assert caplog.records[0].message.startswith("Could not parse text_as_html")
1349+
assert caplog.records[0].message.endswith("not valid html")
1350+
1351+
def it_handles_html_without_table_element_in_text_as_html_without_error(self, caplog):
1352+
"""Regression test: gracefully skip HTML-based chunking when text_as_html has no <table>.
1353+
1354+
`HtmlTable.from_html_text` raises `ValueError` when the HTML is valid but contains no
1355+
`<table>` element. The chunker should log a warning and fall back to text-only chunking
1356+
rather than raising.
1357+
"""
1358+
table = Table(
1359+
"Header Col 1 Header Col 2\nLorem ipsum dolor sit amet",
1360+
metadata=ElementMetadata(text_as_html="<div>no table here</div>"),
1361+
)
1362+
1363+
caplog.set_level(logging.WARNING)
1364+
# -- should not raise ValueError --
1365+
chunks = list(_TableChunker.iter_chunks(table, "", ChunkingOptions()))
1366+
1367+
# -- falls back to text-only: a single Table chunk with no .text_as_html --
1368+
assert len(chunks) == 1
1369+
chunk = chunks[0]
1370+
assert isinstance(chunk, Table)
1371+
assert chunk.metadata.text_as_html is None
1372+
assert len(caplog.records) == 1
1373+
assert caplog.records[0].message.startswith("Could not parse text_as_html")
1374+
assert "<div>no table here</div>" in caplog.records[0].message
1375+
13251376

13261377
# ================================================================================================
13271378
# HTML SPLITTERS
@@ -1333,7 +1384,8 @@ class Describe_HtmlTableSplitter:
13331384

13341385
def it_splits_an_HTML_table_on_whole_row_boundaries_when_possible(self):
13351386
opts = ChunkingOptions(max_characters=(40))
1336-
html_table = HtmlTable.from_html_text("""
1387+
html_table = HtmlTable.from_html_text(
1388+
"""
13371389
<table border="1" class="dataframe">
13381390
<tbody>
13391391
<tr>
@@ -1364,7 +1416,8 @@ def it_splits_an_HTML_table_on_whole_row_boundaries_when_possible(self):
13641416
</tr>
13651417
</tbody>
13661418
</table>
1367-
""")
1419+
"""
1420+
)
13681421

13691422
assert list(_HtmlTableSplitter.iter_subtables(html_table, opts)) == [
13701423
(
@@ -1389,7 +1442,8 @@ def it_splits_an_HTML_table_on_whole_row_boundaries_when_possible(self):
13891442

13901443
def and_it_splits_an_oversized_row_on_an_even_cell_boundary_when_possible(self):
13911444
opts = ChunkingOptions(max_characters=(93))
1392-
html_table = HtmlTable.from_html_text("""
1445+
html_table = HtmlTable.from_html_text(
1446+
"""
13931447
<html><body><table>
13941448
<tr>
13951449
<td>Lorem ipsum dolor sit amet.</td>
@@ -1408,7 +1462,8 @@ def and_it_splits_an_oversized_row_on_an_even_cell_boundary_when_possible(self):
14081462
<td>Cillum</td>
14091463
</tr>
14101464
</table></body></html>
1411-
""")
1465+
"""
1466+
)
14121467

14131468
assert list(_HtmlTableSplitter.iter_subtables(html_table, opts)) == [
14141469
(
@@ -1433,7 +1488,8 @@ def and_it_splits_an_oversized_row_on_an_even_cell_boundary_when_possible(self):
14331488

14341489
def and_it_splits_an_oversized_cell_on_an_even_word_boundary(self):
14351490
opts = ChunkingOptions(max_characters=(100))
1436-
html_table = HtmlTable.from_html_text("""
1491+
html_table = HtmlTable.from_html_text(
1492+
"""
14371493
<table>
14381494
<thead>
14391495
<tr>
@@ -1452,7 +1508,8 @@ def and_it_splits_an_oversized_cell_on_an_even_word_boundary(self):
14521508
<tr><td>In reprehenderit voluptate.</td></tr>
14531509
</tbody>
14541510
</table
1455-
""")
1511+
"""
1512+
)
14561513

14571514
assert list(_HtmlTableSplitter.iter_subtables(html_table, opts)) == [
14581515
(

test_unstructured_ingest/src/elasticsearch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ source scripts/elasticsearch-test-helpers/common/es-dest-ingest-test-creds.env
2222
function cleanup() {
2323
# Kill the container so the script can be repeatedly run using the same ports
2424
echo "Stopping Elasticsearch Docker container"
25-
docker-compose -f scripts/elasticsearch-test-helpers/common/docker-compose.yaml down --remove-orphans -v
25+
docker compose -f scripts/elasticsearch-test-helpers/common/docker-compose.yaml down --remove-orphans -v
2626

2727
cleanup_dir "$OUTPUT_DIR"
2828
cleanup_dir "$WORK_DIR"

test_unstructured_ingest/src/kafka-local.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ function cleanup() {
3333
fi
3434

3535
echo "Stopping local Kafka instance"
36-
docker-compose -f scripts/kafka-test-helpers/docker-compose.yml down --remove-orphans -v
36+
docker compose -f scripts/kafka-test-helpers/docker-compose.yml down --remove-orphans -v
3737

3838
}
3939
trap cleanup EXIT

test_unstructured_ingest/src/opensearch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ source "$SCRIPT_DIR"/cleanup.sh
2020
function cleanup() {
2121
# Kill the container so the script can be repeatedly run using the same ports
2222
echo "Stopping OpenSearch Docker container"
23-
docker-compose -f scripts/opensearch-test-helpers/common/docker-compose.yaml down --remove-orphans -v
23+
docker compose -f scripts/opensearch-test-helpers/common/docker-compose.yaml down --remove-orphans -v
2424

2525
cleanup_dir "$OUTPUT_DIR"
2626
cleanup_dir "$WORK_DIR"

test_unstructured_ingest/src/s3-minio.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ source "$SCRIPT_DIR"/cleanup.sh
2222
function cleanup() {
2323
# Kill the container so the script can be repeatedly run using the same ports
2424
echo "Stopping Minio Docker container"
25-
docker-compose -f scripts/minio-test-helpers/docker-compose.yaml down --remove-orphans -v
25+
docker compose -f scripts/minio-test-helpers/docker-compose.yaml down --remove-orphans -v
2626

2727
cleanup_dir "$OUTPUT_DIR"
2828
cleanup_dir "$WORK_DIR"

test_unstructured_ingest/src/sftp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ source "$SCRIPT_DIR"/cleanup.sh
1919
function cleanup() {
2020
# Kill the container so the script can be repeatedly run using the same ports
2121
echo "Stopping Sftp Docker container"
22-
docker-compose -f scripts/sftp-test-helpers/docker-compose.yaml down --remove-orphans -v
22+
docker compose -f scripts/sftp-test-helpers/docker-compose.yaml down --remove-orphans -v
2323

2424
cleanup_dir "$OUTPUT_DIR"
2525
cleanup_dir "$WORK_DIR"

0 commit comments

Comments
 (0)