deepset-ai · sjrl · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
@@ -12,6 +12,10 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header
 | PreProcessor | Description |
 | --- | --- |
 | [ChineseDocumentSplitter](preprocessors/chinesedocumentsplitter.mdx) | Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization. |
+| [ChonkieRecursiveDocumentSplitter](preprocessors/chonkierecursivedocumentsplitter.mdx) | Splits documents recursively using a hierarchy of rules via Chonkie's `RecursiveChunker`, applying progressively finer splits until all chunks satisfy the size constraints. |
+| [ChonkieSemanticDocumentSplitter](preprocessors/chonkiesemanticdocumentsplitter.mdx) | Splits documents at semantic topic boundaries using embedding similarity via Chonkie's `SemanticChunker`, keeping related sentences together. |
+| [ChonkieSentenceDocumentSplitter](preprocessors/chonkiesentencedocumentsplitter.mdx) | Splits documents into chunks that respect sentence boundaries via Chonkie's `SentenceChunker`, avoiding mid-sentence cuts. |
+| [ChonkieTokenDocumentSplitter](preprocessors/chonkietokendocumentsplitter.mdx) | Splits documents into fixed-size token-based chunks via Chonkie's `TokenChunker`, supporting multiple tokenizers. |
 | [CSVDocumentCleaner](preprocessors/csvdocumentcleaner.mdx) | Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. |
 | [CSVDocumentSplitter](preprocessors/csvdocumentsplitter.mdx) | Divides CSV documents into smaller sub-tables based on empty rows and columns. |
 | [DocumentCleaner](preprocessors/documentcleaner.mdx) | Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents. |

@@ -0,0 +1,129 @@
+---
+title: "ChonkieRecursiveDocumentSplitter"
+id: chonkierecursivedocumentsplitter
+slug: "/chonkierecursivedocumentsplitter"
+description: "Use `ChonkieRecursiveDocumentSplitter` to split documents recursively using a hierarchy of rules, powered by the Chonkie library."
+---
+
+# ChonkieRecursiveDocumentSplitter
+
+`ChonkieRecursiveDocumentSplitter` splits documents using a hierarchy of splitting rules via [Chonkie](https://docs.chonkie.ai/)'s `RecursiveChunker`.
+It applies progressively finer-grained splits until all chunks satisfy the configured size constraints, making it effective for structured text like Markdown or code.
+
+<div className="key-value-table">
+
+|  |  |
+| --- | --- |
+| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) |
+| **Mandatory run variables**            | `documents`: A list of documents |
+| **Output variables**                   | `documents`: A list of documents |
+| **API reference**                      | [Chonkie](/reference/integrations-chonkie) |
+| **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |
+
+</div>
+
+## Overview
+
+`ChonkieRecursiveDocumentSplitter` wraps Chonkie's `RecursiveChunker` to split documents by applying splitting rules level by level.
+If a chunk produced at one level still exceeds `chunk_size`, the next level's rules are applied to it.
+This continues recursively until all chunks are within the size limit.
+
+You can customize the splitting behavior by providing `RecursiveRules` from Chonkie.
+See the [Chonkie documentation](https://docs.chonkie.ai/) for details on defining custom rules.
+
+Each output document includes the original document's metadata plus:
+- `source_id`: ID of the original document
+- `page_number`: Page number of the chunk within the original document
+- `split_id`: Index of the chunk within the document
+- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
+- `token_count`: Number of tokens in the chunk
+
+## Installation
+
+```bash
+pip install chonkie-haystack
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `tokenizer` | `"character"` | Tokenizer to use. Common options: `"character"`, `"gpt2"`, `"cl100k_base"`. See [Chonkie docs](https://docs.chonkie.ai/) for all options. |
+| `chunk_size` | `2048` | Maximum number of tokens per chunk. |
+| `min_characters_per_chunk` | `24` | Minimum number of characters a chunk must contain. |
+| `rules` | `None` | Custom `RecursiveRules` defining the splitting hierarchy. If `None`, Chonkie's default rules are used. |
+| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
+| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |
+
+## Usage
+
+### On its own
+
+```python
+from haystack import Document
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieRecursiveDocumentSplitter,
+)
+
+chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
+documents = [
+    Document(
+        content="# Introduction\n\nHaystack is a framework.\n\n## Features\n\nIt supports RAG pipelines.",
+    ),
+]
+result = chunker.run(documents=documents)
+print(result["documents"])
+```
+
+### With custom rules
+
+```python
+from chonkie.types.recursive import RecursiveLevel, RecursiveRules
+from haystack import Document
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieRecursiveDocumentSplitter,
+)
+
+rules = RecursiveRules(
+    levels=[
+        RecursiveLevel(delimiters=["\n\n"]),
+        RecursiveLevel(delimiters=["\n"]),
+        RecursiveLevel(delimiters=[". ", "! ", "? "]),
+    ],
+)
+
+chunker = ChonkieRecursiveDocumentSplitter(chunk_size=256, rules=rules)
+documents = [Document(content="First paragraph.\n\nSecond paragraph with more detail.")]
+result = chunker.run(documents=documents)
+print(result["documents"])
+```
+
+### In a pipeline
+
+```python
+from pathlib import Path
+
+from haystack import Pipeline
+from haystack.components.converters import TextFileToDocument
+from haystack.components.preprocessors import DocumentCleaner
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieRecursiveDocumentSplitter,
+)
+
+document_store = InMemoryDocumentStore()
+
+p = Pipeline()
+p.add_component("converter", TextFileToDocument())
+p.add_component("cleaner", DocumentCleaner())
+p.add_component("splitter", ChonkieRecursiveDocumentSplitter(chunk_size=512))
+p.add_component("writer", DocumentWriter(document_store=document_store))
+
+p.connect("converter.documents", "cleaner.documents")
+p.connect("cleaner.documents", "splitter.documents")
+p.connect("splitter.documents", "writer.documents")
+
+files = list(Path("path/to/your/files").glob("*.md"))
+p.run({"converter": {"sources": files}})
+```
@@ -0,0 +1,119 @@
+---
+title: "ChonkieSemanticDocumentSplitter"
+id: chonkiesemanticdocumentsplitter
+slug: "/chonkiesemanticdocumentsplitter"
+description: "Use `ChonkieSemanticDocumentSplitter` to split documents at semantic topic boundaries using embedding similarity, powered by the Chonkie library."
+---
+
+# ChonkieSemanticDocumentSplitter
+
+`ChonkieSemanticDocumentSplitter` splits documents at semantically meaningful boundaries using [Chonkie](https://docs.chonkie.ai/)'s `SemanticChunker`.
+Rather than splitting by a fixed token count, it uses an embedding model to detect topic shifts and keeps related sentences together.
+
+<div className="key-value-table">
+
+|  |  |
+| --- | --- |
+| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) |
+| **Mandatory run variables**            | `documents`: A list of documents |
+| **Output variables**                   | `documents`: A list of documents |
+| **API reference**                      | [Chonkie](/reference/integrations-chonkie) |
+| **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |
+
+</div>
+
+## Overview
+
+`ChonkieSemanticDocumentSplitter` wraps Chonkie's `SemanticChunker` to produce context-aware chunks by grouping sentences with similar semantic content.
+It computes embeddings for sentences and uses cosine similarity to find natural topic boundaries.
+
+The embedding model is loaded lazily — `warm_up()` is called automatically the first time `run()` is invoked, whether inside a pipeline or standalone.
+
+Each output document includes the original document's metadata plus:
+- `source_id`: ID of the original document
+- `page_number`: Page number of the chunk within the original document
+- `split_id`: Index of the chunk within the document
+- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
+- `token_count`: Number of tokens in the chunk
+
+## Installation
+
+```bash
+pip install chonkie-haystack
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `embedding_model` | `"minishlab/potion-base-32M"` | The embedding model used to compute sentence similarity. See [Chonkie docs](https://docs.chonkie.ai/) for supported models. |
+| `threshold` | `0.8` | Cosine similarity threshold below which a sentence boundary becomes a split point. |
+| `chunk_size` | `2048` | Maximum number of tokens per chunk (based on the embedding model's tokenizer). |
+| `similarity_window` | `3` | Number of surrounding sentences to include when computing similarity. |
+| `min_sentences_per_chunk` | `1` | Minimum number of sentences that must be included in each chunk. |
+| `min_characters_per_sentence` | `24` | Minimum number of characters for a sentence to be considered valid. |
+| `delim` | `None` | Custom sentence delimiters. If `None`, Chonkie's default delimiters are used. |
+| `include_delim` | `"prev"` | Whether to attach the delimiter to the previous (`"prev"`) or next (`"next"`) chunk. |
+| `skip_window` | `0` | Number of sentences to skip when computing similarity scores. |
+| `filter_window` | `5` | Window size for the Savitzky-Golay smoothing filter applied to similarity scores. |
+| `filter_polyorder` | `3` | Polynomial order for the Savitzky-Golay filter. |
+| `filter_tolerance` | `0.2` | Tolerance used when filtering similarity scores. |
+| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
+| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |
+
+## Usage
+
+### On its own
+
+```python
+from haystack import Document
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieSemanticDocumentSplitter,
+)
+
+chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)
+
+documents = [
+    Document(
+        content="Haystack is an open-source framework for LLM applications. "
+        "It makes building RAG pipelines easy. "
+        "The Eiffel Tower is located in Paris. "
+        "Paris is the capital of France.",
+    ),
+]
+result = chunker.run(documents=documents)
+print(result["documents"])
+```
+
+### In a pipeline
+
+```python
+from pathlib import Path
+
+from haystack import Pipeline
+from haystack.components.converters import TextFileToDocument
+from haystack.components.preprocessors import DocumentCleaner
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieSemanticDocumentSplitter,
+)
+
+document_store = InMemoryDocumentStore()
+
+p = Pipeline()
+p.add_component("converter", TextFileToDocument())
+p.add_component("cleaner", DocumentCleaner())
+p.add_component(
+    "splitter",
+    ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5),
+)
+p.add_component("writer", DocumentWriter(document_store=document_store))
+
+p.connect("converter.documents", "cleaner.documents")
+p.connect("cleaner.documents", "splitter.documents")
+p.connect("splitter.documents", "writer.documents")
+
+files = list(Path("path/to/your/files").glob("*.txt"))
+p.run({"converter": {"sources": files}})
+```
@@ -0,0 +1,113 @@
+---
+title: "ChonkieSentenceDocumentSplitter"
+id: chonkiesentencedocumentsplitter
+slug: "/chonkiesentencedocumentsplitter"
+description: "Use `ChonkieSentenceDocumentSplitter` to split documents into sentence-aware chunks using the Chonkie library."
+---
+
+# ChonkieSentenceDocumentSplitter
+
+`ChonkieSentenceDocumentSplitter` splits documents into chunks that respect sentence boundaries using [Chonkie](https://docs.chonkie.ai/)'s `SentenceChunker`.
+Unlike pure token splitting, it avoids cutting mid-sentence, producing more coherent chunks.
+
+<div className="key-value-table">
+
+|  |  |
+| --- | --- |
+| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx) and [`DocumentCleaner`](documentcleaner.mdx), before [Embedders](../embedders.mdx) |
+| **Mandatory run variables**            | `documents`: A list of documents |
+| **Output variables**                   | `documents`: A list of documents |
+| **API reference**                      | [Chonkie](/reference/integrations-chonkie) |
+| **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |
+
+</div>
+
+## Overview
+
+`ChonkieSentenceDocumentSplitter` wraps Chonkie's `SentenceChunker` to split each input document into chunks whose boundaries align with sentence endings.
+The chunker groups sentences together until the chunk size limit is reached.
+
+Each output document includes the original document's metadata plus:
+- `source_id`: ID of the original document
+- `page_number`: Page number of the chunk within the original document
+- `split_id`: Index of the chunk within the document
+- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
+- `token_count`: Number of tokens in the chunk
+
+## Installation
+
+```bash
+pip install chonkie-haystack
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `tokenizer` | `"character"` | Tokenizer to use. Common options: `"character"`, `"gpt2"`, `"cl100k_base"`. See [Chonkie docs](https://docs.chonkie.ai/) for all options. |
+| `chunk_size` | `2048` | Maximum number of tokens per chunk. |
+| `chunk_overlap` | `0` | Number of overlapping tokens between consecutive chunks. |
+| `min_sentences_per_chunk` | `1` | Minimum number of sentences that must be included in each chunk. |
+| `min_characters_per_sentence` | `12` | Minimum number of characters for a sentence to be considered valid. |
+| `approximate` | `False` | Whether to use approximate chunking for faster processing. |
+| `delim` | `None` | Custom sentence delimiters. If `None`, Chonkie's default delimiters are used. |
+| `include_delim` | `"prev"` | Whether to attach the delimiter to the previous (`"prev"`) or next (`"next"`) chunk. |
+| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
+| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |
+
+## Usage
+
+### On its own
+
+```python
+from haystack import Document
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieSentenceDocumentSplitter,
+)
+
+chunker = ChonkieSentenceDocumentSplitter(
+    tokenizer="gpt2",
+    chunk_size=512,
+    chunk_overlap=0,
+)
+documents = [
+    Document(
+        content="Haystack is an open-source framework. It helps you build LLM applications.",
+    ),
+]
+result = chunker.run(documents=documents)
+print(result["documents"])
+```
+
+### In a pipeline
+
+```python
+from pathlib import Path
+
+from haystack import Pipeline
+from haystack.components.converters import TextFileToDocument
+from haystack.components.preprocessors import DocumentCleaner
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack_integrations.components.preprocessors.chonkie import (
+    ChonkieSentenceDocumentSplitter,
+)
+
+document_store = InMemoryDocumentStore()
+
+p = Pipeline()
+p.add_component("converter", TextFileToDocument())
+p.add_component("cleaner", DocumentCleaner())
+p.add_component(
+    "splitter",
+    ChonkieSentenceDocumentSplitter(tokenizer="gpt2", chunk_size=512),
+)
+p.add_component("writer", DocumentWriter(document_store=document_store))
+
+p.connect("converter.documents", "cleaner.documents")
+p.connect("cleaner.documents", "splitter.documents")
+p.connect("splitter.documents", "writer.documents")
+
+files = list(Path("path/to/your/files").glob("*.txt"))
+p.run({"converter": {"sources": files}})
+```