Skip to content

Commit c2c4d42

Browse files
committed
Fix #1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.
1 parent f68e753 commit c2c4d42

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

crawl4ai/content_scraping_strategy.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -542,6 +542,19 @@ def remove_empty_elements_fast(self, root, word_count_threshold=5):
542542
if el.tag in bypass_tags:
543543
continue
544544

545+
# Skip elements inside <pre> or <code> tags where whitespace is significant
546+
# This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
547+
is_in_code_block = False
548+
ancestor = el.getparent()
549+
while ancestor is not None:
550+
if ancestor.tag in ("pre", "code"):
551+
is_in_code_block = True
552+
break
553+
ancestor = ancestor.getparent()
554+
555+
if is_in_code_block:
556+
continue
557+
545558
text_content = (el.text_content() or "").strip()
546559
if (
547560
len(text_content.split()) < word_count_threshold

0 commit comments

Comments
 (0)