Skip to content

bug/br tag tail text loss #3899

Description

@K-Oxon

Describe the bug
The HtmlTable.from_html_text() method drops text content that follows <br/> tags when normalizing HTML tables. This causes loss of important content in table cells that contain line breaks.

To Reproduce

from unstructured.common.html_table import HtmlTable
html_text = """
<table>
<tr>
<td>This is 1st line.<br/>2nd line.<br/>3rd line.</td>
</tr>
</table>
"""
table = HtmlTable.from_html_text(html_text)
print(table.html)

Output:

<table><tr><td>This is 1st line.<br/><br/></td></tr></table>

Expected Output:

<table><tr><td>This is 1st line.<br/>2nd line.<br/>3rd line.</td></tr></table>

Expected behavior

The text content following <br/> tags should be preserved during HTML normalization. Currently, the tail text of <br/> elements is being removed, which results in loss of content.

Screenshots
No screenshots.

Environment Info

  • unstructured version: 0.16.17
  • Python version: 3.11
  • OS: MacOS

Additional context
It is possible that the issue could be resolved by modifying the from_html_text() method to preserve the tail text of <br/> tags while normalising whitespace.

class HtmlTable:
    ...
    @classmethod
    def from_html_text(cls, html_text: str) -> 'CustomHtmlTable':
            ...
            # -- normalize br tag tail text
            if e.tag == "br":
                if e.tail:
                    e.tail = " ".join(e.tail.split())
            else:
                # -- remove tails for non-br elements
                if e.tail:
                    e.tail = None
            ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions