Skip to content

Commit a545c6a

Browse files
MechaCrittersjrl
andauthored
fixed bug where MarkdownHeaderSplitter's split result missed the first direct parent's header in the metadata and added lark to pyproject.toml (#11042)
* fixed bug where MarkdownHeaderSplitter's split result missed the first direct parent's header in the metadata and added lark to pyproject.toml * added release note * reverted changes of adding "lark" * collapsed release note to contain only "fixes" * added test "test_keep_headers_with_secondary_split_preserves_parent_headers_for_first_child" to prove my concept. * Update releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml written description in past tense Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/fix-missing-parent-header-error-MarkdownHeaderSplitter-b5db96e19011b6b9.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py removed the "secondary_split="word"" argument as error happens regardless of if secondary split is present Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py broke down text in unit test for readability Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py added sanity checking (reconstructed text is equal original text) Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py --------- Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
1 parent 91e71ee commit a545c6a

File tree

3 files changed

+92
-8
lines changed

3 files changed

+92
-8
lines changed

haystack/components/preprocessors/markdown_header_splitter.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,6 @@ def _split_text_by_markdown_headers(self, text: str, doc_id: str) -> list[dict]:
147147
# process headers and build chunks
148148
chunks: list[dict] = []
149149
header_stack: list[str | None] = [None] * 6
150-
active_parents: list[str] = [] # track active parent headers
151150
pending_headers: list[str] = [] # store empty headers to prepend to next content
152151
has_content = False # flag to track if any header has content
153152

@@ -169,16 +168,15 @@ def _split_text_by_markdown_headers(self, text: str, doc_id: str) -> list[dict]:
169168

170169
# skip splits w/o content
171170
if not content.strip(): # this strip is needed to avoid counting whitespace as content
172-
# add as parent for subsequent headers
173-
active_parents = [h for h in header_stack[: level - 1] if h is not None]
174-
active_parents.append(header_text)
175171
if self.keep_headers:
176172
header_line = f"{header_prefix} {header_text}"
177173
pending_headers.append(header_line)
178174
continue
179175

180176
has_content = True # at least one header has content
181-
parent_headers = list(active_parents)
177+
# Build parent metadata from the current header stack so the first child of a
178+
# contentful section still inherits its full ancestor chain.
179+
parent_headers = [h for h in header_stack[: level - 1] if h is not None]
182180

183181
logger.debug(
184182
"Creating chunk for header '{header_text}' at level {level}", header_text=header_text, level=level
@@ -198,9 +196,6 @@ def _split_text_by_markdown_headers(self, text: str, doc_id: str) -> list[dict]:
198196
else:
199197
chunks.append({"content": content, "meta": {"header": header_text, "parent_headers": parent_headers}})
200198

201-
# reset active parents
202-
active_parents = [h for h in header_stack[: level - 1] if h is not None]
203-
204199
# return doc unchunked if no headers have content
205200
if not has_content:
206201
logger.info(
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
fixes:
3+
- |
4+
When using the **MarkdownHeaderSplitter**, in the split chunks, the child header previously lost
5+
its direct parent header in the metadata. Previously if one executed the code below:
6+
7+
.. code:: python
8+
from haystack.components.preprocessors import MarkdownHeaderSplitter
9+
from haystack import Document
10+
11+
text = """
12+
# header 1
13+
intro text
14+
15+
## header 1.1
16+
text 1
17+
18+
## header 1.2
19+
text 2
20+
21+
### header 1.2.1
22+
text 3
23+
24+
### header 1.2.2
25+
text 4
26+
"""
27+
28+
document = Document(content=text)
29+
30+
splitter = MarkdownHeaderSplitter(
31+
keep_headers=True,
32+
secondary_split="word"
33+
)
34+
result = splitter.run(documents=[document])["documents"]
35+
36+
for doc in result:
37+
print(f"Header: {doc.meta['header']}, parent headers: {doc.meta['parent_headers']}")
38+
39+
We would have expected this output:
40+
41+
.. code:: text
42+
43+
Header: header 1, parent headers: []
44+
Header: header 1.1, parent headers: ['header 1']
45+
Header: header 1.2, parent headers: ['header 1']
46+
Header: header 1.2.1, parent headers: ['header 1', 'header 1.2']
47+
Header: header 1.2.2, parent headers: ['header 1', 'header 1.2']
48+
49+
But instead we actually got:
50+
51+
.. code:: text
52+
Header: header 1, parent headers: []
53+
Header: header 1.1, parent headers: []
54+
Header: header 1.2, parent headers: ['header 1']
55+
Header: header 1.2.1, parent headers: ['header 1']
56+
Header: header 1.2.2, parent headers: ['header 1', 'header 1.2']
57+
58+
The error happened when a parent header had its own content chunk before the first
59+
child header.
60+
61+
This has been fixed so even when a parent header has its own content chunk before the first child header all content is preserved.

test/components/preprocessors/test_markdown_header_splitter.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,34 @@ def test_basic_split(sample_text):
9191
assert reconstructed_doc == sample_text
9292

9393

94+
def test_keep_headers_preserves_parent_headers_for_first_child():
95+
text = (
96+
"# Header 1\n"
97+
"Intro text\n\n"
98+
"## Header 1.1\n"
99+
"Text 1\n\n"
100+
"## Header 1.2\n"
101+
"Text 2\n\n"
102+
"### Header 1.2.1\n"
103+
"Text 3\n\n"
104+
"### Header 1.2.2\n"
105+
"Text 4\n"
106+
)
107+
splitter = MarkdownHeaderSplitter(keep_headers=True)
108+
split_docs = splitter.run(documents=[Document(content=text)])["documents"]
109+
110+
assert [(doc.meta["header"], doc.meta["parent_headers"]) for doc in split_docs] == [
111+
("Header 1", []),
112+
("Header 1.1", ["Header 1"]),
113+
("Header 1.2", ["Header 1"]),
114+
("Header 1.2.1", ["Header 1", "Header 1.2"]),
115+
("Header 1.2.2", ["Header 1", "Header 1.2"]),
116+
]
117+
# reconstruct original text
118+
reconstructed_text = "".join(doc.content for doc in split_docs)
119+
assert reconstructed_text == text
120+
121+
94122
def test_split_without_headers(sample_text):
95123
splitter = MarkdownHeaderSplitter(keep_headers=False)
96124
docs = [Document(content=sample_text)]

0 commit comments

Comments
 (0)