Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs-website/docs/pipeline-components/preprocessors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header
| PreProcessor | Description |
| --- | --- |
| [ChineseDocumentSplitter](preprocessors/chinesedocumentsplitter.mdx) | Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization. |
| [ChonkieRecursiveDocumentSplitter](preprocessors/chonkierecursivedocumentsplitter.mdx) | Splits documents recursively using a hierarchy of rules via Chonkie's `RecursiveChunker`, applying progressively finer splits until all chunks satisfy the size constraints. |
| [ChonkieSemanticDocumentSplitter](preprocessors/chonkiesemanticdocumentsplitter.mdx) | Splits documents at semantic topic boundaries using embedding similarity via Chonkie's `SemanticChunker`, keeping related sentences together. |
| [ChonkieSentenceDocumentSplitter](preprocessors/chonkiesentencedocumentsplitter.mdx) | Splits documents into chunks that respect sentence boundaries via Chonkie's `SentenceChunker`, avoiding mid-sentence cuts. |
| [ChonkieTokenDocumentSplitter](preprocessors/chonkietokendocumentsplitter.mdx) | Splits documents into fixed-size token-based chunks via Chonkie's `TokenChunker`, supporting multiple tokenizers. |
| [CSVDocumentCleaner](preprocessors/csvdocumentcleaner.mdx) | Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. |
| [CSVDocumentSplitter](preprocessors/csvdocumentsplitter.mdx) | Divides CSV documents into smaller sub-tables based on empty rows and columns. |
| [DocumentCleaner](preprocessors/documentcleaner.mdx) | Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents. |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: "ChonkieRecursiveDocumentSplitter"
id: chonkierecursivedocumentsplitter
slug: "/chonkierecursivedocumentsplitter"
description: "Use `ChonkieRecursiveDocumentSplitter` to split documents recursively using a hierarchy of rules, powered by the Chonkie library."
---

# ChonkieRecursiveDocumentSplitter

`ChonkieRecursiveDocumentSplitter` splits documents using a hierarchy of splitting rules via [Chonkie](https://docs.chonkie.ai/)'s `RecursiveChunker`.
It applies progressively finer-grained splits until all chunks satisfy the configured size constraints, making it effective for structured text like Markdown or code.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) |
| **Mandatory run variables** | `documents`: A list of documents |
| **Output variables** | `documents`: A list of documents |
| **API reference** | [Chonkie](/reference/integrations-chonkie) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |

</div>

## Overview

`ChonkieRecursiveDocumentSplitter` wraps Chonkie's `RecursiveChunker` to split documents by applying splitting rules level by level.
If a chunk produced at one level still exceeds `chunk_size`, the next level's rules are applied to it.
This continues recursively until all chunks are within the size limit.

You can customize the splitting behavior by providing `RecursiveRules` from Chonkie.
See the [Chonkie documentation](https://docs.chonkie.ai/) for details on defining custom rules.

Each output document includes the original document's metadata plus:
- `source_id`: ID of the original document
- `page_number`: Page number of the chunk within the original document
- `split_id`: Index of the chunk within the document
- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
- `token_count`: Number of tokens in the chunk

## Installation

```bash
pip install chonkie-haystack
```

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `tokenizer` | `"character"` | Tokenizer to use. Common options: `"character"`, `"gpt2"`, `"cl100k_base"`. See [Chonkie docs](https://docs.chonkie.ai/) for all options. |
| `chunk_size` | `2048` | Maximum number of tokens per chunk. |
| `min_characters_per_chunk` | `24` | Minimum number of characters a chunk must contain. |
| `rules` | `None` | Custom `RecursiveRules` defining the splitting hierarchy. If `None`, Chonkie's default rules are used. |
| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |

## Usage

### On its own

```python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [
Document(
content="# Introduction\n\nHaystack is a framework.\n\n## Features\n\nIt supports RAG pipelines.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])
```

### With custom rules

```python
from chonkie.types.recursive import RecursiveLevel, RecursiveRules
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

rules = RecursiveRules(
levels=[
RecursiveLevel(delimiters=["\n\n"]),
RecursiveLevel(delimiters=["\n"]),
RecursiveLevel(delimiters=[". ", "! ", "? "]),
],
)

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=256, rules=rules)
documents = [Document(content="First paragraph.\n\nSecond paragraph with more detail.")]
result = chunker.run(documents=documents)
print(result["documents"])
```

### In a pipeline

```python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component("splitter", ChonkieRecursiveDocumentSplitter(chunk_size=512))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.md"))
p.run({"converter": {"sources": files}})
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: "ChonkieSemanticDocumentSplitter"
id: chonkiesemanticdocumentsplitter
slug: "/chonkiesemanticdocumentsplitter"
description: "Use `ChonkieSemanticDocumentSplitter` to split documents at semantic topic boundaries using embedding similarity, powered by the Chonkie library."
---

# ChonkieSemanticDocumentSplitter

`ChonkieSemanticDocumentSplitter` splits documents at semantically meaningful boundaries using [Chonkie](https://docs.chonkie.ai/)'s `SemanticChunker`.
Rather than splitting by a fixed token count, it uses an embedding model to detect topic shifts and keeps related sentences together.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) |
| **Mandatory run variables** | `documents`: A list of documents |
| **Output variables** | `documents`: A list of documents |
| **API reference** | [Chonkie](/reference/integrations-chonkie) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |

</div>

## Overview

`ChonkieSemanticDocumentSplitter` wraps Chonkie's `SemanticChunker` to produce context-aware chunks by grouping sentences with similar semantic content.
It computes embeddings for sentences and uses cosine similarity to find natural topic boundaries.

The embedding model is loaded lazily — `warm_up()` is called automatically the first time `run()` is invoked, whether inside a pipeline or standalone.

Each output document includes the original document's metadata plus:
- `source_id`: ID of the original document
- `page_number`: Page number of the chunk within the original document
- `split_id`: Index of the chunk within the document
- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
- `token_count`: Number of tokens in the chunk

## Installation

```bash
pip install chonkie-haystack
```

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `embedding_model` | `"minishlab/potion-base-32M"` | The embedding model used to compute sentence similarity. See [Chonkie docs](https://docs.chonkie.ai/) for supported models. |
| `threshold` | `0.8` | Cosine similarity threshold below which a sentence boundary becomes a split point. |
| `chunk_size` | `2048` | Maximum number of tokens per chunk (based on the embedding model's tokenizer). |
| `similarity_window` | `3` | Number of surrounding sentences to include when computing similarity. |
| `min_sentences_per_chunk` | `1` | Minimum number of sentences that must be included in each chunk. |
| `min_characters_per_sentence` | `24` | Minimum number of characters for a sentence to be considered valid. |
| `delim` | `None` | Custom sentence delimiters. If `None`, Chonkie's default delimiters are used. |
| `include_delim` | `"prev"` | Whether to attach the delimiter to the previous (`"prev"`) or next (`"next"`) chunk. |
| `skip_window` | `0` | Number of sentences to skip when computing similarity scores. |
| `filter_window` | `5` | Window size for the Savitzky-Golay smoothing filter applied to similarity scores. |
| `filter_polyorder` | `3` | Polynomial order for the Savitzky-Golay filter. |
| `filter_tolerance` | `0.2` | Tolerance used when filtering similarity scores. |
| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |

## Usage

### On its own

```python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)

documents = [
Document(
content="Haystack is an open-source framework for LLM applications. "
"It makes building RAG pipelines easy. "
"The Eiffel Tower is located in Paris. "
"Paris is the capital of France.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])
```

### In a pipeline

```python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
"splitter",
ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5),
)
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: "ChonkieSentenceDocumentSplitter"
id: chonkiesentencedocumentsplitter
slug: "/chonkiesentencedocumentsplitter"
description: "Use `ChonkieSentenceDocumentSplitter` to split documents into sentence-aware chunks using the Chonkie library."
---

# ChonkieSentenceDocumentSplitter

`ChonkieSentenceDocumentSplitter` splits documents into chunks that respect sentence boundaries using [Chonkie](https://docs.chonkie.ai/)'s `SentenceChunker`.
Unlike pure token splitting, it avoids cutting mid-sentence, producing more coherent chunks.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx) and [`DocumentCleaner`](documentcleaner.mdx), before [Embedders](../embedders.mdx) |
| **Mandatory run variables** | `documents`: A list of documents |
| **Output variables** | `documents`: A list of documents |
| **API reference** | [Chonkie](/reference/integrations-chonkie) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |

</div>

## Overview

`ChonkieSentenceDocumentSplitter` wraps Chonkie's `SentenceChunker` to split each input document into chunks whose boundaries align with sentence endings.
The chunker groups sentences together until the chunk size limit is reached.

Each output document includes the original document's metadata plus:
- `source_id`: ID of the original document
- `page_number`: Page number of the chunk within the original document
- `split_id`: Index of the chunk within the document
- `split_idx_start` / `split_idx_end`: Character offsets of the chunk in the original text
- `token_count`: Number of tokens in the chunk

## Installation

```bash
pip install chonkie-haystack
```

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `tokenizer` | `"character"` | Tokenizer to use. Common options: `"character"`, `"gpt2"`, `"cl100k_base"`. See [Chonkie docs](https://docs.chonkie.ai/) for all options. |
| `chunk_size` | `2048` | Maximum number of tokens per chunk. |
| `chunk_overlap` | `0` | Number of overlapping tokens between consecutive chunks. |
| `min_sentences_per_chunk` | `1` | Minimum number of sentences that must be included in each chunk. |
| `min_characters_per_sentence` | `12` | Minimum number of characters for a sentence to be considered valid. |
| `approximate` | `False` | Whether to use approximate chunking for faster processing. |
| `delim` | `None` | Custom sentence delimiters. If `None`, Chonkie's default delimiters are used. |
| `include_delim` | `"prev"` | Whether to attach the delimiter to the previous (`"prev"`) or next (`"next"`) chunk. |
| `skip_empty_documents` | `True` | Whether to skip documents with empty content. |
| `page_break_character` | `"\f"` | Character used to detect page breaks when tracking page numbers. |

## Usage

### On its own

```python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSentenceDocumentSplitter,
)

chunker = ChonkieSentenceDocumentSplitter(
tokenizer="gpt2",
chunk_size=512,
chunk_overlap=0,
)
documents = [
Document(
content="Haystack is an open-source framework. It helps you build LLM applications.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])
```

### In a pipeline

```python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSentenceDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
"splitter",
ChonkieSentenceDocumentSplitter(tokenizer="gpt2", chunk_size=512),
)
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})
```
Loading
Loading