feature: Chinese DocumentSplitter by mc112611 · Pull Request #9453 · deepset-ai/haystack

mc112611 · 2025-05-29T09:56:19Z

Related Issues

No related issue. This PR originated from a community discussion about improving Chinese document splitting support.

Proposed Changes

This PR introduces a ChineseDocumentSplitter that supports accurate sentence and paragraph splitting for Chinese documents. It leverages the HanLP library for Chinese linguistic analysis, including sentence segmentation and tokenization.

In addition, a full end-to-end example notebook (examples/chinese_RAG_test_haystack_chinese.ipynb) is included, demonstrating how to use the splitter in a RAG pipeline for Chinese documents.

How did you test it?

Manual testing with multiple Chinese texts
Verified the RAG pipeline using the provided Jupyter Notebook
Ran hatch run test:unit to ensure unit tests pass

Notes for the reviewer

This feature is non-breaking and fully compatible with existing RAG pipelines
The implementation is isolated in a custom splitter class, leaving the default behavior untouched

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related discussion with new insights and changes
I added unit tests and updated the docstrings
I've used a conventional commit type for the PR title (feat: in this case)
I documented my code
I ran pre-commit hooks
I added a release note using hatch run release-note chinese-document-splitter

review-notebook-app · 2025-05-29T09:56:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

CLAassistant · 2025-05-29T09:56:26Z

All committers have signed the CLA.

davidsbatista · 2025-06-03T09:19:31Z

Hi @mc112611!

Thanks for your contribution, appreciated, still a few things need to be improved:

Please remove the notebook, if you wish we can later integrate the notebook in our tutorials repository
Please remove the file original_pipeline.png

Don't do any changes on the config.yml in the release notes, instead add the release notes to the newly created yml file in the notes directory.

Just add to the highlights:

highlights: >
  Added support for Chinese DocumentSplitter, enabling precise splitting of Chinese documents.

Thanks!

- Removed notebook and original_pipeline.png - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module

mc112611 · 2025-06-03T12:28:53Z

Hi @mc112611!

Thanks for your contribution, appreciated, still a few things need to be improved:

Please remove the notebook, if you wish we can later integrate the notebook in our tutorials repository

Please remove the file original_pipeline.png

Don't do any changes on the config.yml in the release notes, instead add the release notes to the newly created yml file in the notes directory.

Just add to the highlights:
highlights: >
  Added support for Chinese DocumentSplitter, enabling precise splitting of Chinese documents.
Thanks!

"Thanks for the feedback! I've addressed all the issues you mentioned, including:

Removed if name == "main" block.

Replaced HanLP import with LazyImport.

Moved release notes to notes/ directory and updated highlights.

Removed notebook and original_pipeline.png.

Please let me know if any further changes are needed. 😊"

mc112611 · 2025-06-03T12:37:29Z

Hi Haystack team,

Thank you for the review and feedback. After the PR is merged, I plan to submit the chinese-haystack-RAG.ipynb notebook to the haystack-tutorials repository. This notebook will serve as a tutorial supporting Chinese language use cases.

Looking forward to your approval.

Best regards,
mc112611

davidsbatista · 2025-06-03T14:22:38Z

@mc112611 thanks for the changes, I still have more changes and detailed review to make, I will attend it as soon as possible.

davidsbatista · 2025-06-04T14:10:27Z

@mc112611 I've cleaned the release notes - I will start a more detailed review now.

Can you please also add tests?

davidsbatista · 2025-06-04T14:13:03Z

@mc112611 make sure you run the pre-commit hooks, see here - since there are lots of issues with unused imports and code formating

coveralls · 2025-06-04T15:41:43Z

Pull Request Test Coverage Report for Build 15447031649

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.2%) to 89.228%

Totals
Change from base Build 15446010796:	-1.2%
Covered Lines:	11481
Relevant Lines:	12867

💛 - Coveralls

… and fix lint issues

mc112611 · 2025-06-05T08:19:00Z

Hello everyone,

I have completed the following work and committed the changes:

Added a test script located at test/components/preprocessors/test_chinese_document_splitter.py
Removed Chinese comments from haystack/components/preprocessors/chinese_document_splitter.py
Tested ChineseDocumentSplitter thoroughly using the command:
hatch run pytest test/components/preprocessors/test_chinese_document_splitter.py
The tests cover:
- Splitting by word
- Splitting by sentence
- Respecting sentence boundaries
- Overlapping window splitting

Additionally, I followed the instructions in CONTRIBUTING.md and performed:

Ran pre-commit install
Encountered an error when running hatch run test:unit: ModuleNotFoundError: No module named 'hanlp'. Installing hanlp via pip install hanlp did not resolve this issue.
Encountered an error when running hatch run test:types due to a nested haystack folder inside the haystack directory: haystack\core\component\component.py: error: Source file found twice under different module names: "CUDA_Preject.haystack.haystack.core.component.component" and "haystack.core.component.component"

Furthermore, I have completed the following formatting and checks:

hatch run format-check completed
hatch run format completed
hatch run check --fix completed

Please let me know if there are any further suggestions or if I can assist in resolving the above environment issues. Thank you!

Best regards,
[mc112611]

davidsbatista · 2025-06-05T08:51:27Z

Hi @mc112611 - thanks for taking those into account.

I have improved some aspects of your PR and pushed directly to this branch, so we are working on this together. Be aware of that, and before committing any changes, ensure you have the latest version.

Thank you for adding the tests - I will run the complete CI and do another code review. I will let you know if any doubts or questions arise.

Thank you once again!

mc112611 · 2025-06-05T09:38:53Z

I initially attempted to add self' to certain functions, but some methods were marked with '@staticmethod', making them inaccessible via 'self'. While the core functionalities I tested in the script worked as expected, features like pagination were not actively maintained, so there might be room for improvement there.

davidsbatista · 2025-06-05T09:45:51Z

Hi @mc112611 thanks for the insights.

I need to work a bit on this PR to make it pass all the tests and CI - don't sync it with main, please, we can do it later, as this now raises some issues.

julian-risch · 2025-06-05T15:10:01Z

Hello @mc112611 Thank you so much for opening this PR after the initial GitHub discussion! I just talked to @davidsbatista and we agreed that we can take over from here and make the final changes before the PR can be merged ourselves. 🙂
For that purpose, David and I will continue working directly in the Haystack repository instead of your fork. Your commits are preserved in that new PR of course giving you credit for your great work! #9494
I will close the old PR with this comment.

David and I know what changes the two of us still want to apply before releasing the new feature. It's mainly about tests and the loading of the hanlp models.
We have only one question: Are you using your ChineseDocumentSplitter with English input too or only with Chinese? Currently, the ChineseDocumentSplitter also handles English input for example here. We believe it's better to only handle Chinese input in the ChineseDocumentSplitter and I am planning to update the code. Does that idea make sense to you? Or do you use your ChineseDocumentSplitter for English and Chinese input at the same time?

Again, thank you so much and stay tuned for the next Haystack release in ~3 weeks, where we will include your ChineseDocumentSplitter! We might be able to release it even faster if we add the ChineseDocumentSplitter to haystack-core-integrations instead.

mc112611 · 2025-06-05T23:35:32Z

Hi, thank you so much for your kind message, and I really appreciate the opportunity to contribute!

In fact, I implemented the ChineseDocumentSplitter by subclassing the existing DocumentSplitter because I wanted to stay as aligned as possible with Haystack's data flow and design patterns.

As for the language handling: I agree with your suggestion that ChineseDocumentSplitter should only process Chinese input. Currently, the best practice is to use language='zh' to explicitly control this. I haven't specifically tested how it behaves with language='en', but I will look into it. It’s certainly expected that if the input is in English while language='zh' is set, the splitter would not apply any segmentation.

Regarding hanlp: it's a powerful tool for Chinese sentence and word tokenization. However, I haven't rigorously tested it on mixed-language (Chinese-English) documents. In practical applications, routing documents to the appropriate splitter based on language detection — perhaps using Haystack’s built-in language classifier — seems like a good long-term approach.

On a side note, I’m currently working on a Chinese RAG demo based on Haystack, which builds on the notebook I submitted earlier. Once it's cleaned up, I plan to contribute it to the Haystack tutorial repository.

Thank you again — I've learned a great deal from reading and working with the Haystack codebase. The structure has really helped me grow as a developer. If there's anything I can help with further, feel free to let me know!

julian-risch · 2025-06-06T06:18:52Z

That's very helpful, thank you. By the way, you don't need to test how the ChineseDocumentSplitter behaves with language='en'. David and I will reduce the code to the part that handles language='zh' then.
Looking forward to the Chinese RAG demo!

mc112611 added 2 commits May 29, 2025 16:36

Add Chinese DocumentSplitter support with examples

3ffa3b7

fix: update tests and release notes

307f834

mc112611 requested review from a team as code owners May 29, 2025 09:56

mc112611 requested review from davidsbatista and dfokina and removed request for a team May 29, 2025 09:56

github-actions Bot added the type:documentation Improvements on the docs label May 29, 2025

mc112611 added 2 commits June 1, 2025 07:02

Merge branch 'main' into feature/chinese-document-splitter

78d10d3

Merge branch 'main' into feature/chinese-document-splitter

a25d171

davidsbatista reviewed Jun 3, 2025

View reviewed changes

Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated

davidsbatista reviewed Jun 3, 2025

View reviewed changes

Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated

davidsbatista reviewed Jun 3, 2025

View reviewed changes

Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated

mc112611 and others added 2 commits June 3, 2025 18:45

Merge branch 'main' into feature/chinese-document-splitter

4331b49

Fix according to review:

7b6d05a

- Removed notebook and original_pipeline.png - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module

davidsbatista added 3 commits June 4, 2025 14:20

Merge branch 'main' into feature/chinese-document-splitter

eda0f79

cleaning release notes

089d207

cleaning release notes

20ea234

davidsbatista reviewed Jun 4, 2025

View reviewed changes

Comment thread haystack/components/preprocessors/chinese_document_spliter.py Outdated

davidsbatista reviewed Jun 4, 2025

View reviewed changes

Comment thread haystack/components/preprocessors/chinese_document_spliter.py Outdated

davidsbatista changed the title ~~Feature/chinese document splitter~~ feature: Chinese DocumentSplitter Jun 4, 2025

davidsbatista added 2 commits June 4, 2025 17:36

cleaning up

34552f2

Merge branch 'main' into feature/chinese-document-splitter

9bc0fc0

adding hanlp dependency

32cd95c

github-actions Bot added the topic:build/distribution label Jun 4, 2025

davidsbatista and others added 2 commits June 4, 2025 18:00

fixing lazy import

7b2d038

Add test script for ChineseDocumentSplitter, remove Chinese comments,…

10ddc6e

… and fix lint issues

github-actions Bot added the topic:tests label Jun 5, 2025

davidsbatista and others added 6 commits June 5, 2025 10:59

wip

efb1aef

adding LICENSE header to tests

e8ec98c

wip

d5d0606

adding LICENSE header

67c1d43

fixing linting issues

a1dea7f

Merge branch 'main' into feature/chinese-document-splitter

1fd4dfd

fixing tests

6d69816

davidsbatista added 3 commits June 5, 2025 11:47

Merge branch 'main' into chinese-splitter

50b36d2

fixing tests

c3e87d0

fixing conflicnts with old main

fb23209

julian-risch closed this Jun 5, 2025

julian-risch mentioned this pull request Jun 5, 2025

feat: Add ChineseDocumentSplitter #9494

Closed

This was referenced Jun 13, 2025

New HanLP integration with ChineseDocumentSplitter deepset-ai/haystack-core-integrations#1944

Closed

feat: add new HanLP integration with ChineseDocumentSplitter deepset-ai/haystack-core-integrations#1943

Merged

Conversation

mc112611 commented May 29, 2025 • edited by julian-risch Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

review-notebook-app Bot commented May 29, 2025

Uh oh!

CLAassistant commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidsbatista commented Jun 3, 2025

Uh oh!

mc112611 commented Jun 3, 2025

Uh oh!

mc112611 commented Jun 3, 2025

Uh oh!

davidsbatista commented Jun 3, 2025

Uh oh!

davidsbatista commented Jun 4, 2025

Uh oh!

davidsbatista commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15447031649

Details

💛 - Coveralls

Uh oh!

mc112611 commented Jun 5, 2025

Uh oh!

davidsbatista commented Jun 5, 2025

Uh oh!

mc112611 commented Jun 5, 2025

Uh oh!

davidsbatista commented Jun 5, 2025

Uh oh!

julian-risch commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mc112611 commented Jun 5, 2025

Uh oh!

julian-risch commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mc112611 commented May 29, 2025 •

edited by julian-risch

Loading

CLAassistant commented May 29, 2025 •

edited

Loading

coveralls commented Jun 4, 2025 •

edited

Loading

julian-risch commented Jun 5, 2025 •

edited

Loading