feature: Chinese DocumentSplitter#9453
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @mc112611! Thanks for your contribution, appreciated, still a few things need to be improved:
Don't do any changes on the config.yml in the release notes, instead add the release notes to the newly created yml file in the Just add to the highlights: Thanks! |
- Removed notebook and original_pipeline.png - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module
"Thanks for the feedback! I've addressed all the issues you mentioned, including: Removed if name == "main" block. Replaced HanLP import with LazyImport. Moved release notes to notes/ directory and updated highlights. Removed notebook and original_pipeline.png. Please let me know if any further changes are needed. 😊" |
|
Hi Haystack team, Thank you for the review and feedback. After the PR is merged, I plan to submit the chinese-haystack-RAG.ipynb notebook to the haystack-tutorials repository. This notebook will serve as a tutorial supporting Chinese language use cases. Looking forward to your approval. Best regards, |
|
@mc112611 thanks for the changes, I still have more changes and detailed review to make, I will attend it as soon as possible. |
|
@mc112611 I've cleaned the release notes - I will start a more detailed review now. Can you please also add tests? |
Pull Request Test Coverage Report for Build 15447031649Details
💛 - Coveralls |
|
Hello everyone, I have completed the following work and committed the changes:
Additionally, I followed the instructions in
Furthermore, I have completed the following formatting and checks:
Please let me know if there are any further suggestions or if I can assist in resolving the above environment issues. Thank you! Best regards, |
|
Hi @mc112611 - thanks for taking those into account. I have improved some aspects of your PR and pushed directly to this branch, so we are working on this together. Be aware of that, and before committing any changes, ensure you have the latest version. Thank you for adding the tests - I will run the complete CI and do another code review. I will let you know if any doubts or questions arise. Thank you once again! |
|
I initially attempted to add self' to certain functions, but some methods were marked with '@staticmethod', making them inaccessible via 'self'. While the core functionalities I tested in the script worked as expected, features like pagination were not actively maintained, so there might be room for improvement there. |
|
Hi @mc112611 thanks for the insights. I need to work a bit on this PR to make it pass all the tests and CI - don't sync it with main, please, we can do it later, as this now raises some issues. |
|
Hello @mc112611 Thank you so much for opening this PR after the initial GitHub discussion! I just talked to @davidsbatista and we agreed that we can take over from here and make the final changes before the PR can be merged ourselves. 🙂 David and I know what changes the two of us still want to apply before releasing the new feature. It's mainly about tests and the loading of the hanlp models. Again, thank you so much and stay tuned for the next Haystack release in ~3 weeks, where we will include your ChineseDocumentSplitter! We might be able to release it even faster if we add the ChineseDocumentSplitter to haystack-core-integrations instead. |
|
Hi, thank you so much for your kind message, and I really appreciate the opportunity to contribute! In fact, I implemented the As for the language handling: I agree with your suggestion that Regarding On a side note, I’m currently working on a Chinese RAG demo based on Haystack, which builds on the notebook I submitted earlier. Once it's cleaned up, I plan to contribute it to the Haystack tutorial repository. Thank you again — I've learned a great deal from reading and working with the Haystack codebase. The structure has really helped me grow as a developer. If there's anything I can help with further, feel free to let me know! |
|
That's very helpful, thank you. By the way, you don't need to test how the |
Related Issues
No related issue. This PR originated from a community discussion about improving Chinese document splitting support.
Proposed Changes
This PR introduces a
ChineseDocumentSplitterthat supports accurate sentence and paragraph splitting for Chinese documents. It leverages the HanLP library for Chinese linguistic analysis, including sentence segmentation and tokenization.In addition, a full end-to-end example notebook (
examples/chinese_RAG_test_haystack_chinese.ipynb) is included, demonstrating how to use the splitter in a RAG pipeline for Chinese documents.How did you test it?
hatch run test:unitto ensure unit tests passNotes for the reviewer
Checklist
feat:in this case)hatch run release-note chinese-document-splitter