Skip to content

Fixes incorrect ID generation for identical chunks in RecursiveDocumentSplitter#9517

Merged
mpangrazzi merged 3 commits intodeepset-ai:mainfrom
gulbaki:fix/recursive-splitter-unique-ids
Jun 16, 2025
Merged

Fixes incorrect ID generation for identical chunks in RecursiveDocumentSplitter#9517
mpangrazzi merged 3 commits intodeepset-ai:mainfrom
gulbaki:fix/recursive-splitter-unique-ids

Conversation

@gulbaki
Copy link
Copy Markdown
Contributor

@gulbaki gulbaki commented Jun 14, 2025

Related Issues

Proposed Changes:

The issue occurred because Document.id is generated based on the content and meta at creation time.

However, meta fields like split_id and parent_id were added after the Document was instantiated, causing chunks with identical content and meta to produce identical ids.

  • Includes a unit test that verifies uniqueness of IDs and correct metadata assignment

How did you test it?

  • Added a unit test: test_recursive_splitter_generates_unique_ids_and_correct_meta

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@gulbaki gulbaki requested review from a team as code owners June 14, 2025 13:23
@gulbaki gulbaki requested review from dfokina and mpangrazzi and removed request for a team June 14, 2025 13:23
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 14, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I've left some comments below.

meta = deepcopy(doc.meta)
meta["parent_id"] = doc.id
meta["split_id"] = split_nr
meta["split_idx_start"] = current_position
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've added this, you need to remove:

new_doc.meta["split_idx_start"] = current_position

line below

meta["split_idx_start"] = current_position
new_doc = Document(content=chunk, meta=meta)
new_doc.meta["split_idx_start"] = current_position
new_doc.meta["_split_overlap"] = [] if self.split_overlap > 0 else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the fix's intent, I would set all metadata should before Document creation, not after. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you’re right. I’ve moved split_idx_start and _split_overlap.

• `split_id` values must be 0, 1, 2, …
"""
from haystack import Document
from haystack.components.preprocessors.recursive_splitter import RecursiveDocumentSplitter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those imports are already present at the top of the file. Please remove them.

def test_recursive_splitter_generates_unique_ids_and_correct_meta():
"""
• Every produced chunk must have a unique `id`.
• Each chunk must carry `parent_id` = original doc’s ID.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring is not strictly needed (the test name and the comments are already descriptive enough).

@gulbaki gulbaki requested a review from mpangrazzi June 16, 2025 13:54
@mpangrazzi mpangrazzi enabled auto-merge (squash) June 16, 2025 19:00
@mpangrazzi mpangrazzi disabled auto-merge June 16, 2025 19:01
@mpangrazzi mpangrazzi enabled auto-merge (squash) June 16, 2025 19:01
@coveralls
Copy link
Copy Markdown
Collaborator

Pull Request Test Coverage Report for Build 15689509354

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.001%) to 90.144%

Totals Coverage Status
Change from base Build 15683643191: 0.001%
Covered Lines: 11543
Relevant Lines: 12805

💛 - Coveralls

Copy link
Copy Markdown
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpangrazzi mpangrazzi merged commit 7dbac5b into deepset-ai:main Jun 16, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RecursiveDocumentSplitter updates Document's meta field after initializing it

4 participants