Skip to content

fix(html): enable huge_tree on HTMLParser so deeply nested HTML partitions#4340

Open
CrepuscularIRIS wants to merge 2 commits intoUnstructured-IO:mainfrom
CrepuscularIRIS:fix/html-huge-tree
Open

fix(html): enable huge_tree on HTMLParser so deeply nested HTML partitions#4340
CrepuscularIRIS wants to merge 2 commits intoUnstructured-IO:mainfrom
CrepuscularIRIS:fix/html-huge-tree

Conversation

@CrepuscularIRIS
Copy link
Copy Markdown

Summary

Fixes #4289.

partition_html silently returned zero elements for HTML documents whose DOM depth exceeded lxml's default huge_tree=False depth limit (~256 levels). The module-level etree.HTMLParser in unstructured/partition/html/parser.py was constructed without huge_tree, so etree.fromstring dropped the subtree beyond the limit and the downstream flow saw an empty document.

This PR flips the kwarg to huge_tree=True on that single shared parser and adds a regression test.

Root Cause

html_parser = etree.HTMLParser(remove_comments=True) inherits lxml's default huge_tree=False, which caps tree depth / node text size. When etree.fromstring(html_text, html_parser) in partition.py:258 hits that cap, the subtree past the cap is dropped, producing an empty Flow root — so partition_html yields [] for large/deeply-nested HTML.

Changes

  • unstructured/partition/html/parser.py: pass huge_tree=True to the shared etree.HTMLParser (1 line).
  • test_unstructured/partition/html/test_partition.py: add test_partition_html_parses_deeply_nested_html, which builds a 260-level <div> wrapper around a <p> and asserts the inner text is recovered.
  • CHANGELOG.md: new 0.22.22 section under ### Fixes.
  • unstructured/__version__.py: bump 0.22.210.22.22.

Testing

  • New regression test passes with the fix and fails without it (verified by reverting the kwarg locally).
  • test_unstructured/partition/html/312/312 passing.
  • Broader run test_unstructured/partition/html/ + partition/test_xml.py + partition/test_text.py390/390 passing.
pytest test_unstructured/partition/html/test_partition.py::test_partition_html_parses_deeply_nested_html -v

Notes

  • huge_tree=True on HTMLParser relaxes lxml's tree-size / depth safeguards. It does not enable entity expansion (that flag is resolve_entities, which remains off by default), so no new XXE surface is introduced.
  • Minimal fix — no other behavior or API is changed. Only production change is one kwarg; tree is still shared across all callers.
  • Happy to switch to a per-call huge_tree opt-in if preferred, but the shared-parser approach matches what the issue and the prior self-closed PR fix: enable huge_tree for HTMLParser to handle large documents #4306 proposed.

…tions

Fixes Unstructured-IO#4289

`partition_html` returned an empty element list for HTML documents whose
DOM depth exceeded lxml's default depth limit (~256) because the
module-level `etree.HTMLParser` used the default `huge_tree=False`, which
silently drops subtrees past the limit. Enabling `huge_tree=True` on the
shared parser makes deep documents round-trip correctly. A regression
test builds a 260-level-deep `<div>` chain wrapping a `<p>`; the test
fails without the fix (0 elements) and passes with it (1 element).
@cragwolfe
Copy link
Copy Markdown
Contributor

image

ref: https://lxml.de/FAQ.html

There is a safety reason why this should not be the default. I think the default should remain False. @CrepuscularIRIS , can you amend by having this PR to instead have the behavior overridable by env var (with default False)?

Per @cragwolfe review: huge_tree=True disables libxml2's safety guards
against malicious inputs (https://lxml.de/FAQ.html), so it must remain
opt-in. Default stays huge_tree=False; set UNSTRUCTURED_HTML_HUGE_TREE
to 1/true/yes to enable for trusted inputs.

- Add test confirming default behavior drops nodes silently (matches
  prior behavior — no regression for existing users).
- Test for the opt-in path patches the parser since it's built at
  module import time.
- Updated changelog to document the env var and the security tradeoff.
@CrepuscularIRIS
Copy link
Copy Markdown
Author

Thanks @cragwolfe — fully agree on the safety concern. I've pushed fe6216d which:

  • Restores huge_tree=False as the default
  • Adds opt-in via the UNSTRUCTURED_HTML_HUGE_TREE env var (accepts 1/true/yes)
  • Updates the changelog to document the env var and link the lxml security note
  • Adds a second test asserting the default-off behavior so the safety posture is enforced going forward

Both tests pass locally; full test_partition.py suite still passes (105/105).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large HTML documents cannot be partitioned using the partition_html function.

2 participants