fix(html): enable huge_tree on HTMLParser so deeply nested HTML partitions by CrepuscularIRIS · Pull Request #4340 · Unstructured-IO/unstructured

CrepuscularIRIS · 2026-04-16T22:57:50Z

Summary

partition_html silently returned zero elements for HTML documents whose DOM depth exceeded lxml's default huge_tree=False depth limit (~256 levels). The module-level etree.HTMLParser in unstructured/partition/html/parser.py was constructed without huge_tree, so etree.fromstring dropped the subtree beyond the limit and the downstream flow saw an empty document.

This PR flips the kwarg to huge_tree=True on that single shared parser and adds a regression test.

Root Cause

html_parser = etree.HTMLParser(remove_comments=True) inherits lxml's default huge_tree=False, which caps tree depth / node text size. When etree.fromstring(html_text, html_parser) in partition.py:258 hits that cap, the subtree past the cap is dropped, producing an empty Flow root — so partition_html yields [] for large/deeply-nested HTML.

Changes

unstructured/partition/html/parser.py: pass huge_tree=True to the shared etree.HTMLParser (1 line).
test_unstructured/partition/html/test_partition.py: add test_partition_html_parses_deeply_nested_html, which builds a 260-level <div> wrapper around a <p> and asserts the inner text is recovered.
CHANGELOG.md: new 0.22.22 section under ### Fixes.
unstructured/__version__.py: bump 0.22.21 → 0.22.22.

Testing

New regression test passes with the fix and fails without it (verified by reverting the kwarg locally).
test_unstructured/partition/html/ — 312/312 passing.
Broader run test_unstructured/partition/html/ + partition/test_xml.py + partition/test_text.py — 390/390 passing.

pytest test_unstructured/partition/html/test_partition.py::test_partition_html_parses_deeply_nested_html -v

Notes

huge_tree=True on HTMLParser relaxes lxml's tree-size / depth safeguards. It does not enable entity expansion (that flag is resolve_entities, which remains off by default), so no new XXE surface is introduced.
Minimal fix — no other behavior or API is changed. Only production change is one kwarg; tree is still shared across all callers.
Happy to switch to a per-call huge_tree opt-in if preferred, but the shared-parser approach matches what the issue and the prior self-closed PR fix: enable huge_tree for HTMLParser to handle large documents #4306 proposed.

cragwolfe · 2026-04-18T19:49:52Z

ref: https://lxml.de/FAQ.html

There is a safety reason why this should not be the default. I think the default should remain False. @CrepuscularIRIS , can you amend by having this PR to instead have the behavior overridable by env var (with default False)?

CrepuscularIRIS · 2026-04-19T11:46:47Z

Thanks @cragwolfe — fully agree on the safety concern. I've pushed fe6216d which:

Restores huge_tree=False as the default
Adds opt-in via the UNSTRUCTURED_HTML_HUGE_TREE env var (accepts 1/true/yes)
Updates the changelog to document the env var and link the lxml security note
Adds a second test asserting the default-off behavior so the safety posture is enforced going forward

Both tests pass locally; full test_partition.py suite still passes (105/105).

…tions Fixes Unstructured-IO#4289 `partition_html` returned an empty element list for HTML documents whose DOM depth exceeded lxml's default depth limit (~256) because the module-level `etree.HTMLParser` used the default `huge_tree=False`, which silently drops subtrees past the limit. Enabling `huge_tree=True` on the shared parser makes deep documents round-trip correctly. A regression test builds a 260-level-deep `<div>` chain wrapping a `<p>`; the test fails without the fix (0 elements) and passes with it (1 element). Signed-off-by: CrepuscularIRIS <serenitygp@qq.com>

@cragwolfe

Per @cragwolfe review: huge_tree=True disables libxml2's safety guards against malicious inputs (https://lxml.de/FAQ.html), so it must remain opt-in. Default stays huge_tree=False; set UNSTRUCTURED_HTML_HUGE_TREE to 1/true/yes to enable for trusted inputs. - Add test confirming default behavior drops nodes silently (matches prior behavior — no regression for existing users). - Test for the opt-in path patches the parser since it's built at module import time. - Updated changelog to document the env var and the security tradeoff. Signed-off-by: CrepuscularIRIS <serenitygp@qq.com>

CrepuscularIRIS · 2026-05-01T20:51:57Z

Rebased onto main to clear the merge conflict — head is now a8511544 (and 3f4909f5 for the parent). Both commits now carry Signed-off-by trailers; CI was green before the rebase, expect the same here. Ready for another look whenever you have a moment, @cragwolfe.

cragwolfe · 2026-05-05T04:59:59Z

code change looks good, just the CHANGELOG.md got botched with deletions against main. also need to bump version. thanks!

e.g.: CrepuscularIRIS/unstructured@fix/html-huge-tree...Unstructured-IO:unstructured:crag/pr-4340-do-not-merge
(also good to bump the version)

CrepuscularIRIS · 2026-05-05T08:54:26Z

Will fix the CHANGELOG merge artifact and bump the version, then push shortly.

CrepuscularIRIS added 2 commits May 1, 2026 16:51

CrepuscularIRIS force-pushed the fix/html-huge-tree branch from fe6216d to a851154 Compare May 1, 2026 20:51

cragwolfe approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(html): enable huge_tree on HTMLParser so deeply nested HTML partitions#4340

fix(html): enable huge_tree on HTMLParser so deeply nested HTML partitions#4340
CrepuscularIRIS wants to merge 2 commits into
Unstructured-IO:mainfrom
CrepuscularIRIS:fix/html-huge-tree

CrepuscularIRIS commented Apr 16, 2026

Uh oh!

cragwolfe commented Apr 18, 2026

Uh oh!

CrepuscularIRIS commented Apr 19, 2026

Uh oh!

CrepuscularIRIS commented May 1, 2026

Uh oh!

cragwolfe commented May 5, 2026 •

edited

Loading

Uh oh!

CrepuscularIRIS commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

CrepuscularIRIS commented Apr 16, 2026

Summary

Root Cause

Changes

Testing

Notes

Uh oh!

cragwolfe commented Apr 18, 2026

Uh oh!

CrepuscularIRIS commented Apr 19, 2026

Uh oh!

CrepuscularIRIS commented May 1, 2026

Uh oh!

cragwolfe commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CrepuscularIRIS commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cragwolfe commented May 5, 2026 •

edited

Loading