feat: implement chonkie integration with four chunkers#3223
feat: implement chonkie integration with four chunkers#3223sjrl merged 17 commits intodeepset-ai:mainfrom
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Hey @yugborana thanks for opening the contribution! I left an in depth review for one of the chunkers if you could also apply the same comments to other chunkers where appropriate that would be very appreciated! |
|
@sjrl thanks for the review. It was pretty fast. I will take forward from here... |
|
Thanks @yugborana! Also I discussed with the team and we think to be more consistent with our existing component name we should rename the Chonkie components to be like |
…na/haystack-core-integrations into feat/chonkie-integration
|
@sjrl resolved all the issues. |
|
@sjrl now I think it looks fine |
|
@yugborana one more minor comment then ready to go! |
|
@sjrl done |
sjrl
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
Related Issues
Proposed Changes:
Implemented the Chonkie integration for Haystack by building four modular preprocessing components: ChonkieSemanticChunker, ChonkieRecursiveChunker, ChonkieTokenChunker, and ChonkieSentenceChunker.
Each component initializes its respective Chonkie engine with parameters (like thresholds, chunk_overlap, and custom delimiter rules). When executed in a pipeline (run()), the components delegate the input text to the underlying Chonkie algorithm. Finally, they package the resulting chunks back into new Haystack Document objects—preserving the original metadata while injecting new, precise annotations like start_index, end_index, and token_count.
How did you test it?
Unit Tests: Built 16 tests covering initialization, to_dict/from_dict serialization, and run() logic for all 4 chunkers. Ran hatch run test:unit and achieved a 100% pass rate.
Integration Tests: N/A (Chonkie runs locally and doesn't rely on external databases or cloud APIs).
Manual Verification: Successfully executed live_test.py on real models (minishlab/potion-base-32M) to confirm dynamic semantic splits, token overlaps, and custom delimiter extraction without using mocks.
Instructions for Manual Tests:
cd integrations/chonkie
Run unit tests: hatch run test:unit
Run the live demonstration: hatch run python live_test.py
Notes for the reviewer
Checklist
yes - I have read the contributors guidelines and the code of conduct
yes - I added unit tests and updated the docstrings
yes - I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.