Skip to content

feature: Chinese DocumentSplitter#9453

Closed
mc112611 wants to merge 24 commits intodeepset-ai:mainfrom
mc112611:feature/chinese-document-splitter
Closed

feature: Chinese DocumentSplitter#9453
mc112611 wants to merge 24 commits intodeepset-ai:mainfrom
mc112611:feature/chinese-document-splitter

Conversation

@mc112611
Copy link
Copy Markdown

@mc112611 mc112611 commented May 29, 2025

Related Issues

No related issue. This PR originated from a community discussion about improving Chinese document splitting support.


Proposed Changes

This PR introduces a ChineseDocumentSplitter that supports accurate sentence and paragraph splitting for Chinese documents. It leverages the HanLP library for Chinese linguistic analysis, including sentence segmentation and tokenization.

In addition, a full end-to-end example notebook (examples/chinese_RAG_test_haystack_chinese.ipynb) is included, demonstrating how to use the splitter in a RAG pipeline for Chinese documents.


How did you test it?

  • Manual testing with multiple Chinese texts
  • Verified the RAG pipeline using the provided Jupyter Notebook
  • Ran hatch run test:unit to ensure unit tests pass

Notes for the reviewer

  • This feature is non-breaking and fully compatible with existing RAG pipelines
  • The implementation is isolated in a custom splitter class, leaving the default behavior untouched

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related discussion with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used a conventional commit type for the PR title (feat: in this case)
  • I documented my code
  • I ran pre-commit hooks
  • I added a release note using hatch run release-note chinese-document-splitter

@mc112611 mc112611 requested review from a team as code owners May 29, 2025 09:56
@mc112611 mc112611 requested review from davidsbatista and dfokina and removed request for a team May 29, 2025 09:56
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the type:documentation Improvements on the docs label May 29, 2025
Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated
Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated
Comment thread haystack/components/preprocessors/test_chinese_document_spliter.py Outdated
@davidsbatista
Copy link
Copy Markdown
Contributor

Hi @mc112611!

Thanks for your contribution, appreciated, still a few things need to be improved:

  • Please remove the notebook, if you wish we can later integrate the notebook in our tutorials repository
  • Please remove the file original_pipeline.png

Don't do any changes on the config.yml in the release notes, instead add the release notes to the newly created yml file in the notes directory.

Just add to the highlights:

highlights: >
  Added support for Chinese DocumentSplitter, enabling precise splitting of Chinese documents.

Thanks!

mc112611 and others added 2 commits June 3, 2025 18:45
- Removed notebook and original_pipeline.png
- Added release note YAML file in notes/
- Reverted config.yaml
- Implemented lazy import for hanlp
- Removed main guard block from module
@mc112611
Copy link
Copy Markdown
Author

mc112611 commented Jun 3, 2025

Hi @mc112611!

Thanks for your contribution, appreciated, still a few things need to be improved:

  • Please remove the notebook, if you wish we can later integrate the notebook in our tutorials repository
  • Please remove the file original_pipeline.png

Don't do any changes on the config.yml in the release notes, instead add the release notes to the newly created yml file in the notes directory.

Just add to the highlights:

highlights: >
  Added support for Chinese DocumentSplitter, enabling precise splitting of Chinese documents.

Thanks!

"Thanks for the feedback! I've addressed all the issues you mentioned, including:

Removed if name == "main" block.

Replaced HanLP import with LazyImport.

Moved release notes to notes/ directory and updated highlights.

Removed notebook and original_pipeline.png.

Please let me know if any further changes are needed. 😊"

@mc112611
Copy link
Copy Markdown
Author

mc112611 commented Jun 3, 2025

Hi Haystack team,

Thank you for the review and feedback. After the PR is merged, I plan to submit the chinese-haystack-RAG.ipynb notebook to the haystack-tutorials repository. This notebook will serve as a tutorial supporting Chinese language use cases.

Looking forward to your approval.

Best regards,
mc112611

@davidsbatista
Copy link
Copy Markdown
Contributor

@mc112611 thanks for the changes, I still have more changes and detailed review to make, I will attend it as soon as possible.

@davidsbatista
Copy link
Copy Markdown
Contributor

@mc112611 I've cleaned the release notes - I will start a more detailed review now.

Can you please also add tests?

@davidsbatista
Copy link
Copy Markdown
Contributor

@mc112611 make sure you run the pre-commit hooks, see here - since there are lots of issues with unused imports and code formating

Comment thread haystack/components/preprocessors/chinese_document_spliter.py Outdated
Comment thread haystack/components/preprocessors/chinese_document_spliter.py Outdated
@davidsbatista davidsbatista changed the title Feature/chinese document splitter feature: Chinese DocumentSplitter Jun 4, 2025
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Jun 4, 2025

Pull Request Test Coverage Report for Build 15447031649

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.2%) to 89.228%

Totals Coverage Status
Change from base Build 15446010796: -1.2%
Covered Lines: 11481
Relevant Lines: 12867

💛 - Coveralls

@mc112611
Copy link
Copy Markdown
Author

mc112611 commented Jun 5, 2025

Hello everyone,

I have completed the following work and committed the changes:

  • Added a test script located at test/components/preprocessors/test_chinese_document_splitter.py
  • Removed Chinese comments from haystack/components/preprocessors/chinese_document_splitter.py
  • Tested ChineseDocumentSplitter thoroughly using the command:
    hatch run pytest test/components/preprocessors/test_chinese_document_splitter.py
    The tests cover:
    • Splitting by word
    • Splitting by sentence
    • Respecting sentence boundaries
    • Overlapping window splitting

Additionally, I followed the instructions in CONTRIBUTING.md and performed:

  • Ran pre-commit install
  • Encountered an error when running hatch run test:unit: ModuleNotFoundError: No module named 'hanlp'. Installing hanlp via pip install hanlp did not resolve this issue.
  • Encountered an error when running hatch run test:types due to a nested haystack folder inside the haystack directory: haystack\core\component\component.py: error: Source file found twice under different module names: "CUDA_Preject.haystack.haystack.core.component.component" and "haystack.core.component.component"

Furthermore, I have completed the following formatting and checks:

  • hatch run format-check completed
  • hatch run format completed
  • hatch run check --fix completed

Please let me know if there are any further suggestions or if I can assist in resolving the above environment issues. Thank you!


Best regards,
[mc112611]

@davidsbatista
Copy link
Copy Markdown
Contributor

Hi @mc112611 - thanks for taking those into account.

I have improved some aspects of your PR and pushed directly to this branch, so we are working on this together. Be aware of that, and before committing any changes, ensure you have the latest version.

Thank you for adding the tests - I will run the complete CI and do another code review. I will let you know if any doubts or questions arise.

Thank you once again!

@mc112611
Copy link
Copy Markdown
Author

mc112611 commented Jun 5, 2025

I initially attempted to add self' to certain functions, but some methods were marked with '@staticmethod', making them inaccessible via 'self'. While the core functionalities I tested in the script worked as expected, features like pagination were not actively maintained, so there might be room for improvement there.

@davidsbatista
Copy link
Copy Markdown
Contributor

Hi @mc112611 thanks for the insights.

I need to work a bit on this PR to make it pass all the tests and CI - don't sync it with main, please, we can do it later, as this now raises some issues.

@julian-risch
Copy link
Copy Markdown
Member

julian-risch commented Jun 5, 2025

Hello @mc112611 Thank you so much for opening this PR after the initial GitHub discussion! I just talked to @davidsbatista and we agreed that we can take over from here and make the final changes before the PR can be merged ourselves. 🙂
For that purpose, David and I will continue working directly in the Haystack repository instead of your fork. Your commits are preserved in that new PR of course giving you credit for your great work! #9494
I will close the old PR with this comment.

David and I know what changes the two of us still want to apply before releasing the new feature. It's mainly about tests and the loading of the hanlp models.
We have only one question: Are you using your ChineseDocumentSplitter with English input too or only with Chinese? Currently, the ChineseDocumentSplitter also handles English input for example here. We believe it's better to only handle Chinese input in the ChineseDocumentSplitter and I am planning to update the code. Does that idea make sense to you? Or do you use your ChineseDocumentSplitter for English and Chinese input at the same time?

Again, thank you so much and stay tuned for the next Haystack release in ~3 weeks, where we will include your ChineseDocumentSplitter! We might be able to release it even faster if we add the ChineseDocumentSplitter to haystack-core-integrations instead.

@mc112611
Copy link
Copy Markdown
Author

mc112611 commented Jun 5, 2025

Hi, thank you so much for your kind message, and I really appreciate the opportunity to contribute!

In fact, I implemented the ChineseDocumentSplitter by subclassing the existing DocumentSplitter because I wanted to stay as aligned as possible with Haystack's data flow and design patterns.

As for the language handling: I agree with your suggestion that ChineseDocumentSplitter should only process Chinese input. Currently, the best practice is to use language='zh' to explicitly control this. I haven't specifically tested how it behaves with language='en', but I will look into it. It’s certainly expected that if the input is in English while language='zh' is set, the splitter would not apply any segmentation.

Regarding hanlp: it's a powerful tool for Chinese sentence and word tokenization. However, I haven't rigorously tested it on mixed-language (Chinese-English) documents. In practical applications, routing documents to the appropriate splitter based on language detection — perhaps using Haystack’s built-in language classifier — seems like a good long-term approach.

On a side note, I’m currently working on a Chinese RAG demo based on Haystack, which builds on the notebook I submitted earlier. Once it's cleaned up, I plan to contribute it to the Haystack tutorial repository.

Thank you again — I've learned a great deal from reading and working with the Haystack codebase. The structure has really helped me grow as a developer. If there's anything I can help with further, feel free to let me know!

@julian-risch
Copy link
Copy Markdown
Member

That's very helpful, thank you. By the way, you don't need to test how the ChineseDocumentSplitter behaves with language='en'. David and I will reduce the code to the part that handles language='zh' then.
Looking forward to the Chinese RAG demo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants