Skip to content

Commit d4caedf

Browse files
authored
fix: Preserve Line Breaks in Code Blocks During Chunking (#4196)
## Problem When using `chunk_elements()` on markdown files containing code blocks, line breaks within the code were being discarded, resulting in unreadable code: Fixes #4095 ```python # Before fix - code becomes flattened: "def hello(): print('Hello') return True" # Expected - preserve formatting: "def hello():\n print('Hello')\n return True" ``` ## Root Cause Two issues were identified: 1. **HTML Parser**: `<pre>` elements generated generic `Text` elements instead of `CodeSnippet` elements 2. **Chunking**: The `_iter_text_segments()` method normalized all whitespace to single spaces, destroying newlines ## Solution ### 1. HTML Parser Change (`unstructured/partition/html/parser.py`) Made `<pre>` elements generate `CodeSnippet` elements: ```python class Pre(BlockItem): """Custom element-class for `<pre>` element. Can only contain phrasing content. Generates CodeSnippet elements to preserve code formatting including whitespace and line breaks. """ _ElementCls = CodeSnippet # Added this line ``` ### 2. Chunking Change (`unstructured/chunking/base.py`) Modified `_iter_text_segments()` to preserve whitespace for `CodeSnippet` elements: ```python def _iter_text_segments(self) -> Iterator[str]: """Generate overlap text and each element text segment in order. Empty text segments are not included. CodeSnippet elements preserve their original whitespace (including newlines) to maintain code formatting. """ if self._overlap_prefix: yield self._overlap_prefix for e in self._elements: if e.text and len(e.text): # -- preserve whitespace for code snippets to maintain formatting -- if isinstance(e, CodeSnippet): text = e.text.strip() else: text = " ".join(e.text.strip().split()) if text: yield text ``` ## Files Changed | File | Change | |------|--------| | `unstructured/partition/html/parser.py` | Added `CodeSnippet` import, set `_ElementCls = CodeSnippet` in `Pre` class | | `unstructured/chunking/base.py` | Added `CodeSnippet` import, special handling in `_iter_text_segments()` | | `test_unstructured/partition/html/test_parser.py` | Added test for `CodeSnippet` generation, updated existing test | | `test_unstructured/chunking/test_base.py` | Added 2 tests for whitespace preservation | Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461
1 parent 8f32550 commit d4caedf

10 files changed

Lines changed: 83 additions & 17 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.18.31-dev1
1+
## 0.18.31-dev2
22

33
### Enhancements
44
- Changed default DPI to 350
@@ -7,6 +7,7 @@
77
### Fixes
88
- **Fix Pandoc exitcode 97 during ODT conversion**: Try with sandbox=True first, fallback without sandbox only if `ALLOW_PANDOC_NO_SANDBOX=true` env var is set (fixes #3997)
99
- **Fix `coordinates=True` causing TypeError in hi_res PDF processing**: Filter out `coordinates` and `coordinate_system` from kwargs before passing to `add_element_metadata()` to prevent conflict with explicit parameters (fixes #4126)
10+
- **Preserve line breaks in code blocks during chunking**: `<pre>` elements now generate `CodeSnippet` elements instead of `Text`, and chunking preserves internal whitespace for code snippets. (fixes #4095)
1011

1112
## 0.18.30
1213

test_unstructured/chunking/test_base.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
from unstructured.common.html_table import HtmlCell, HtmlRow, HtmlTable
3030
from unstructured.documents.elements import (
3131
CheckBox,
32+
CodeSnippet,
3233
CompositeElement,
3334
Element,
3435
ElementMetadata,
@@ -802,6 +803,37 @@ def it_knows_the_concatenated_text_of_the_pre_chunk_to_help(
802803
pre_chunk = PreChunk(elements, overlap_prefix=overlap_prefix, opts=ChunkingOptions())
803804
assert pre_chunk._text == expected_value
804805

806+
def it_preserves_whitespace_in_CodeSnippet_elements(self):
807+
"""CodeSnippet elements should preserve their internal whitespace including newlines.
808+
809+
This is important for code blocks where formatting (indentation, line breaks) is
810+
semantically meaningful.
811+
"""
812+
code_text = "def hello():\n print('Hello')\n return True"
813+
pre_chunk = PreChunk([CodeSnippet(code_text)], overlap_prefix="", opts=ChunkingOptions())
814+
815+
# The text should preserve newlines, not collapse them to spaces
816+
assert "\n" in pre_chunk._text
817+
assert pre_chunk._text == code_text
818+
819+
def it_preserves_whitespace_in_CodeSnippet_when_mixed_with_other_elements(self):
820+
"""CodeSnippet whitespace is preserved even when mixed with regular Text elements."""
821+
code_text = "for i in range(10):\n print(i)"
822+
pre_chunk = PreChunk(
823+
[
824+
Text("Here is some code:"),
825+
CodeSnippet(code_text),
826+
Text("That was the code."),
827+
],
828+
overlap_prefix="",
829+
opts=ChunkingOptions(),
830+
)
831+
832+
# The combined text should have the code with preserved newlines
833+
assert "for i in range(10):\n print(i)" in pre_chunk._text
834+
# Regular text elements are still joined with blank line separators
835+
assert "Here is some code:\n\n" in pre_chunk._text
836+
805837

806838
# ================================================================================================
807839
# CHUNKING HELPER/SPLITTERS

test_unstructured/partition/html/test_parser.py

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,15 @@
1010
import pytest
1111
from lxml import etree
1212

13-
from unstructured.documents.elements import Address, Element, ListItem, NarrativeText, Text, Title
13+
from unstructured.documents.elements import (
14+
Address,
15+
CodeSnippet,
16+
Element,
17+
ListItem,
18+
NarrativeText,
19+
Text,
20+
Title,
21+
)
1422
from unstructured.partition.html.parser import (
1523
Annotation,
1624
DefaultElement,
@@ -536,7 +544,7 @@ def it_preserves_the_whitespace_of_its_phrasing_only_contents(self):
536544
elements = pre.iter_elements()
537545

538546
e = next(elements)
539-
assert e == Text(
547+
assert e == CodeSnippet(
540548
" The Answer to the Great Question... Of Life, the Universe and Everything...\n"
541549
" Is... Forty-two, said Deep Thought, with infinite majesty and calm."
542550
)
@@ -585,6 +593,19 @@ def it_assigns_emphasis_and_link_metadata_when_contents_have_those_phrasing_elem
585593
assert e.metadata.link_texts == ["penguin"]
586594
assert e.metadata.link_urls == ["http://eie.io"]
587595

596+
def it_generates_CodeSnippet_elements_to_preserve_code_formatting(self):
597+
"""Pre elements should generate CodeSnippet elements, not generic Text elements.
598+
599+
This ensures code formatting (whitespace, line breaks) is preserved during chunking.
600+
"""
601+
html_text = "<pre>def hello():\n print('Hello')\n return True</pre>"
602+
pre = etree.fromstring(html_text, html_parser).xpath(".//pre")[0]
603+
604+
e = next(pre.iter_elements())
605+
606+
assert isinstance(e, CodeSnippet)
607+
assert e.text == "def hello():\n print('Hello')\n return True"
608+
588609

589610
class DescribeRemovedBlock:
590611
"""Isolated unit-test suite for `unstructured.partition.html.parser.RemovedBlock`.

test_unstructured/partition/html/test_partition.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
from unstructured.cleaners.core import clean_extra_whitespace
2626
from unstructured.documents.elements import (
2727
Address,
28+
CodeSnippet,
2829
CompositeElement,
2930
ElementType,
3031
ListItem,
@@ -627,7 +628,7 @@ def test_partition_html_with_widely_encompassing_pre_tag():
627628
print(f"{len(elements)=}")
628629
assert len(elements) > 0
629630
assert clean_extra_whitespace(elements[0].text).startswith("[107th Congress Public Law 56]")
630-
assert isinstance(elements[0], NarrativeText)
631+
assert isinstance(elements[0], CodeSnippet)
631632
assert elements[0].metadata.filetype == "text/html"
632633
assert elements[0].metadata.filename == "fake-html-pre.htm"
633634

@@ -641,9 +642,9 @@ def test_pre_tag_parsing_respects_order():
641642
"<div>The Big Blue Bear</div>\n"
642643
)
643644
) == [
644-
Text("The Big Brown Bear"),
645+
CodeSnippet("The Big Brown Bear"),
645646
NarrativeText("The big brown bear is growling."),
646-
NarrativeText("The big brown bear is sleeping."),
647+
CodeSnippet("The big brown bear is sleeping."),
647648
Text("The Big Blue Bear"),
648649
]
649650

@@ -672,7 +673,7 @@ def test_partition_html_br_tag_parsing():
672673
Title("Header 1"),
673674
Text("Text"),
674675
Title("Header 2"),
675-
Text(
676+
CodeSnippet(
676677
" Param1 = Y\nParam2 = 1\nParam3 = 2\nParam4 = A\n \nParam5 = A,B,C,D,E\n"
677678
"Param6 = 7\nParam7 = Five\n\n "
678679
),

test_unstructured/partition/test_auto.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
from unstructured.cleaners.core import clean_extra_whitespace
3333
from unstructured.documents.elements import (
3434
Address,
35+
CodeSnippet,
3536
CompositeElement,
3637
Element,
3738
ElementMetadata,
@@ -272,7 +273,7 @@ def test_auto_partition_html_pre_from_file():
272273
assert len(elements) > 0
273274
assert "PageBreak" not in [elem.category for elem in elements]
274275
assert clean_extra_whitespace(elements[0].text).startswith("[107th Congress Public Law 56]")
275-
assert isinstance(elements[0], NarrativeText)
276+
assert isinstance(elements[0], CodeSnippet)
276277
assert all(e.metadata.filetype == "text/html" for e in elements)
277278
assert all(e.metadata.filename == "fake-html-pre.htm" for e in elements)
278279

test_unstructured_ingest/expected-structured-output-html/local-single-file-with-encoding/fake-html-cp1252.html.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,13 @@ <h1 class="Title" id="a59f117741c76dca0bc8f5ee72e2010b">
1616
<p class="UncategorizedText" id="d536ba7636a9a4603a81b358d1fe2590">
1717
Some text with CP1252-specific characters:
1818
</p>
19-
<p class="NarrativeText" id="3b8ca5305e52587b8fbbfcd994de0667">
19+
<div class="CodeSnippet" id="3b8ca5305e52587b8fbbfcd994de0667">
2020
Die schöne Frau hat einen Kaffee mit Kuchen gegessen. Sie sagte: "Das war köstlich!" und lächelte dabei. Der Preis betrug 15,50 €.
2121
L'été était trčs chaud cette année. J'ai acheté un café au lait pour 3,50 €. C'était délicieux ! L'homme a dit : "C'est parfait !"
2222
El nińo comió paella con ńoquis. La seńora dijo: "ˇQué rico!" y pagó 25,75 €. El restaurante tenía un menú del día.
2323
Kvinnan ĺt köttbullar med lingonsylt. Hon sa: "Det var fantastiskt!" och betalade 45,90 €. Mannen frĺgade: "Vill du ha mer?"
2424
O Joăo comprou um café por 2,50 €. Ele disse: "Está ótimo!" e sorriu. A mulher perguntou: "Quer mais alguma coisa?"
2525
De vrouw dronk koffie met koekjes. Ze zei: "Het was heerlijk!" en betaalde 4,25 €. Het kind vroeg: "Mag ik ook wat?"
26-
</p>
26+
</div>
2727
</body>
2828
</html>

test_unstructured_ingest/expected-structured-output/local-single-file-with-encoding/fake-html-cp1252.html.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363
}
6464
},
6565
{
66-
"type": "NarrativeText",
66+
"type": "CodeSnippet",
6767
"element_id": "3b8ca5305e52587b8fbbfcd994de0667",
6868
"text": "Die schöne Frau hat einen Kaffee mit Kuchen gegessen. Sie sagte: \"Das war köstlich!\" und lächelte dabei. Der Preis betrug 15,50 €.\nL'été était trčs chaud cette année. J'ai acheté un café au lait pour 3,50 €. C'était délicieux ! L'homme a dit : \"C'est parfait !\"\nEl nińo comió paella con ńoquis. La seńora dijo: \"ˇQué rico!\" y pagó 25,75 €. El restaurante tenía un menú del día.\nKvinnan ĺt köttbullar med lingonsylt. Hon sa: \"Det var fantastiskt!\" och betalade 45,90 €. Mannen frĺgade: \"Vill du ha mer?\"\nO Joăo comprou um café por 2,50 €. Ele disse: \"Está ótimo!\" e sorriu. A mulher perguntou: \"Quer mais alguma coisa?\"\nDe vrouw dronk koffie met koekjes. Ze zei: \"Het was heerlijk!\" en betaalde 4,25 €. Het kind vroeg: \"Mag ik ook wat?\"",
6969
"metadata": {

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.18.31-dev1" # pragma: no cover
1+
__version__ = "0.18.31-dev2" # pragma: no cover

unstructured/chunking/base.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
from unstructured.common.html_table import HtmlCell, HtmlRow, HtmlTable
1313
from unstructured.documents.elements import (
14+
CodeSnippet,
1415
CompositeElement,
1516
ConsolidationStrategy,
1617
Element,
@@ -610,15 +611,20 @@ def overlap_tail(self) -> str:
610611
def _iter_text_segments(self) -> Iterator[str]:
611612
"""Generate overlap text and each element text segment in order.
612613
613-
Empty text segments are not included.
614+
Empty text segments are not included. CodeSnippet elements preserve their
615+
original whitespace (including newlines) to maintain code formatting.
614616
"""
615617
if self._overlap_prefix:
616618
yield self._overlap_prefix
617619
for e in self._elements:
618620
if e.text and len(e.text):
619-
text = " ".join(e.text.strip().split())
620-
if text:
621-
yield text
621+
# -- preserve all whitespace for code snippets to maintain formatting --
622+
if isinstance(e, CodeSnippet):
623+
yield e.text
624+
else:
625+
text = " ".join(e.text.strip().split())
626+
if text:
627+
yield text
622628

623629
@lazyproperty
624630
def _text(self) -> str:

unstructured/partition/html/parser.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@
8787
from unstructured.common.html_table import htmlify_matrix_of_cell_texts
8888
from unstructured.documents.elements import (
8989
Address,
90+
CodeSnippet,
9091
Element,
9192
ElementMetadata,
9293
EmailAddress,
@@ -470,9 +471,12 @@ class ListItemBlock(Flow):
470471
class Pre(BlockItem):
471472
"""Custom element-class for `<pre>` element.
472473
473-
Can only contain phrasing content.
474+
Can only contain phrasing content. Generates CodeSnippet elements to preserve
475+
code formatting including whitespace and line breaks.
474476
"""
475477

478+
_ElementCls = CodeSnippet
479+
476480
@lazyproperty
477481
def _element_accum(self) -> _ElementAccumulator:
478482
"""Text-segment accumulator suitable for this block-element."""

0 commit comments

Comments
 (0)