Skip to content

Commit 7dbac5b

Browse files
gulbakimpangrazzi
andauthored
Fixes incorrect ID generation for identical chunks in RecursiveDocumentSplitter (#9517)
* fix(preprocessor): ensure RecursiveDocumentSplitter generates unique chunk IDs * fix: update meta handling in RecursiveDocumentSplitter to ensure correct overlap information --------- Co-authored-by: Michele Pangrazzi <xmikex83@gmail.com>
1 parent 7570f6b commit 7dbac5b

3 files changed

Lines changed: 28 additions & 4 deletions

File tree

haystack/components/preprocessors/recursive_splitter.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -423,10 +423,12 @@ def _run_one(self, doc: Document) -> List[Document]:
423423
new_docs: List[Document] = []
424424

425425
for split_nr, chunk in enumerate(chunks):
426-
new_doc = Document(content=chunk, meta=deepcopy(doc.meta))
427-
new_doc.meta["split_id"] = split_nr
428-
new_doc.meta["split_idx_start"] = current_position
429-
new_doc.meta["_split_overlap"] = [] if self.split_overlap > 0 else None
426+
meta = deepcopy(doc.meta)
427+
meta["parent_id"] = doc.id
428+
meta["split_id"] = split_nr
429+
meta["split_idx_start"] = current_position
430+
meta["_split_overlap"] = [] if self.split_overlap > 0 else None
431+
new_doc = Document(content=chunk, meta=meta)
430432

431433
# add overlap information to the previous and current doc
432434
if split_nr > 0 and self.split_overlap > 0:
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
---
2+
fixes:
3+
- |
4+
**RecursiveDocumentSplitter** now generates a unique `Document.id` for every chunk. The meta fields (`split_id`, `parent_id`, etc.) are populated _before_ `Document` creation, so the hash used for `id` generation is always unique.

test/components/preprocessors/test_recursive_splitter.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -990,3 +990,21 @@ def test_run_complex_text_with_multiple_separators():
990990
assert len(chunks[3].content) == 152
991991
assert chunks[3].content.startswith("C")
992992
assert chunks[3].content.endswith("D" * 50)
993+
994+
995+
def test_recursive_splitter_generates_unique_ids_and_correct_meta():
996+
text = "Haystack is awesome. " * 5
997+
source_doc = Document(content=text)
998+
999+
splitter = RecursiveDocumentSplitter(split_length=3)
1000+
splitter.warm_up()
1001+
1002+
chunks = splitter.run([source_doc])["documents"]
1003+
1004+
# IDs must be unique
1005+
assert len({c.id for c in chunks}) == len(chunks)
1006+
1007+
# parent_id and split_id checks
1008+
for idx, chunk in enumerate(chunks):
1009+
assert chunk.meta["parent_id"] == source_doc.id
1010+
assert chunk.meta["split_id"] == idx

0 commit comments

Comments
 (0)