Skip to content

feat: implement chonkie integration with four chunkers#3223

Merged
sjrl merged 17 commits intodeepset-ai:mainfrom
yugborana:feat/chonkie-integration
Apr 27, 2026
Merged

feat: implement chonkie integration with four chunkers#3223
sjrl merged 17 commits intodeepset-ai:mainfrom
yugborana:feat/chonkie-integration

Conversation

@yugborana
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

Implemented the Chonkie integration for Haystack by building four modular preprocessing components: ChonkieSemanticChunker, ChonkieRecursiveChunker, ChonkieTokenChunker, and ChonkieSentenceChunker.

Each component initializes its respective Chonkie engine with parameters (like thresholds, chunk_overlap, and custom delimiter rules). When executed in a pipeline (run()), the components delegate the input text to the underlying Chonkie algorithm. Finally, they package the resulting chunks back into new Haystack Document objects—preserving the original metadata while injecting new, precise annotations like start_index, end_index, and token_count.

How did you test it?

Unit Tests: Built 16 tests covering initialization, to_dict/from_dict serialization, and run() logic for all 4 chunkers. Ran hatch run test:unit and achieved a 100% pass rate.
Integration Tests: N/A (Chonkie runs locally and doesn't rely on external databases or cloud APIs).
Manual Verification: Successfully executed live_test.py on real models (minishlab/potion-base-32M) to confirm dynamic semantic splits, token overlaps, and custom delimiter extraction without using mocks.
Instructions for Manual Tests:
cd integrations/chonkie
Run unit tests: hatch run test:unit
Run the live demonstration: hatch run python live_test.py

Notes for the reviewer

Checklist

yes - I have read the contributors guidelines and the code of conduct

  • I have updated the related issue with new insights and changes
    yes - I added unit tests and updated the docstrings
    yes - I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

@yugborana yugborana requested a review from a team as a code owner April 24, 2026 04:15
@yugborana yugborana requested review from sjrl and removed request for a team April 24, 2026 04:15
@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 24, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 24, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedchonkie@​1.6.499100100100100

View full report

Comment thread .github/workflows/chonkie.yml
Comment thread integrations/chonkie/tests/test_integration.py Outdated
Comment thread integrations/chonkie/tests/test_recursive_splitter.py
Comment thread integrations/chonkie/tests/test_recursive_splitter.py
Comment thread integrations/chonkie/tests/test_recursive_chunker.py Outdated
Comment thread integrations/chonkie/tests/test_recursive_chunker.py Outdated
Comment thread integrations/chonkie/tests/test_recursive_splitter.py
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 24, 2026

Hey @yugborana thanks for opening the contribution! I left an in depth review for one of the chunkers if you could also apply the same comments to other chunkers where appropriate that would be very appreciated!

Comment thread integrations/chonkie/chonkie.md Outdated
@yugborana
Copy link
Copy Markdown
Contributor Author

@sjrl thanks for the review. It was pretty fast. I will take forward from here...

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 24, 2026

Thanks @yugborana!

Also I discussed with the team and we think to be more consistent with our existing component name we should rename the Chonkie components to be like ChonkieRecursiveChunker --> ChonkieRecursiveDocumentSplitter. We tend to use Splitter in our library rather than Chunker (fine to call it chunking in the docs) and we like to be explicit if the i/o type is a Document by putting it in the name of the component. If you could update the names of all components that would be great.

@yugborana
Copy link
Copy Markdown
Contributor Author

@sjrl resolved all the issues.

Comment thread integrations/chonkie/tests/test_recursive_chunker.py Outdated
Comment thread integrations/chonkie/tests/test_recursive_chunker.py Outdated
@yugborana
Copy link
Copy Markdown
Contributor Author

@sjrl now I think it looks fine

@yugborana
Copy link
Copy Markdown
Contributor Author

@sjrl

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 27, 2026

@yugborana one more minor comment then ready to go!

@yugborana
Copy link
Copy Markdown
Contributor Author

@sjrl done

Copy link
Copy Markdown
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@sjrl sjrl merged commit 702983d into deepset-ai:main Apr 27, 2026
18 checks passed
@yugborana yugborana deleted the feat/chonkie-integration branch April 27, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:chonkie topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new Chonkie integration with RecursiveChunker, SemanticChunker and others

2 participants