Skip to content

Commit 8f04c70

Browse files
committed
issues
1 parent a3f77e8 commit 8f04c70

3 files changed

Lines changed: 361 additions & 3 deletions

File tree

.github/workflows/chonkie.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,8 @@ jobs:
4444
steps:
4545
- id: set
4646
run: |
47-
echo 'os=${{ github.event_name == 'push' && '["ubuntu-latest"]' || env.TEST_MATRIX_OS }}' >> $GITHUB_OUTPUT
48-
echo 'python-version=${{ github.event_name == 'push' && '["3.10"]' || env.TEST_MATRIX_PYTHON }}' >> $GITHUB_OUTPUT
47+
echo 'os=${{ github.event_name == 'push' && '["ubuntu-latest"]' || env.TEST_MATRIX_OS }}' >> "$GITHUB_OUTPUT"
48+
echo 'python-version=${{ github.event_name == 'push' && '["3.10"]' || env.TEST_MATRIX_PYTHON }}' >> "$GITHUB_OUTPUT"
4949
5050
run:
5151
name: Python ${{ matrix.python-version }} on ${{ startsWith(matrix.os, 'macos-') && 'macOS' || startsWith(matrix.os, 'windows-') && 'Windows' || 'Linux' }}

integrations/chonkie/chonkie.md

Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
---
2+
title: "Chonkie"
3+
id: integrations-chonkie
4+
description: "Chonkie integration for Haystack"
5+
slug: "/integrations-chonkie"
6+
---
7+
8+
9+
## haystack_integrations.components.preprocessors.chonkie.recursive_chunker
10+
11+
### ChonkieRecursiveChunker
12+
13+
A Document Splitter that uses Chonkie's RecursiveChunker to split documents.
14+
15+
Usage::
16+
17+
```
18+
from haystack import Document
19+
from haystack_integrations.components.preprocessors.chonkie import ChonkieRecursiveChunker
20+
21+
chunker = ChonkieRecursiveChunker(chunk_size=512)
22+
documents = [Document(content="Hello world. This is a test.")]
23+
result = chunker.run(documents=documents)
24+
print(result["documents"])
25+
```
26+
27+
#### __init__
28+
29+
```python
30+
__init__(
31+
tokenizer: str = "character",
32+
chunk_size: int = 2048,
33+
min_characters_per_chunk: int = 24,
34+
rules: Any = None,
35+
) -> None
36+
```
37+
38+
Initializes the ChonkieRecursiveChunker.
39+
40+
**Parameters:**
41+
42+
- **tokenizer** (<code>str</code>) – The tokenizer to use for chunking. Defaults to "character".
43+
- **chunk_size** (<code>int</code>) – The maximum size of each chunk.
44+
- **min_characters_per_chunk** (<code>int</code>) – The minimum number of characters per chunk.
45+
- **rules** (<code>Any</code>) – Custom rules for recursive chunking. If None, default rules are used.
46+
47+
#### run
48+
49+
```python
50+
run(documents: list[Document]) -> dict[str, Any]
51+
```
52+
53+
Splits a list of documents into smaller chunks.
54+
55+
**Parameters:**
56+
57+
- **documents** (<code>list\[Document\]</code>) – The list of documents to split.
58+
59+
**Returns:**
60+
61+
- <code>dict\[str, Any\]</code> – A dictionary with the "documents" key containing the list of chunks.
62+
63+
#### to_dict
64+
65+
```python
66+
to_dict() -> dict[str, Any]
67+
```
68+
69+
Serializes the component to a dictionary.
70+
71+
**Returns:**
72+
73+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
74+
75+
#### from_dict
76+
77+
```python
78+
from_dict(data: dict[str, Any]) -> ChonkieRecursiveChunker
79+
```
80+
81+
Deserializes the component from a dictionary.
82+
83+
**Parameters:**
84+
85+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
86+
87+
**Returns:**
88+
89+
- <code>ChonkieRecursiveChunker</code> – Deserialized component.
90+
91+
## haystack_integrations.components.preprocessors.chonkie.semantic_chunker
92+
93+
### ChonkieSemanticChunker
94+
95+
A Document Splitter that uses Chonkie's SemanticChunker to split documents.
96+
97+
Usage::
98+
99+
```
100+
from haystack import Document
101+
from haystack_integrations.components.preprocessors.chonkie import ChonkieSemanticChunker
102+
103+
chunker = ChonkieSemanticChunker(chunk_size=512)
104+
documents = [Document(content="Hello world. This is a test.")]
105+
result = chunker.run(documents=documents)
106+
print(result["documents"])
107+
```
108+
109+
#### __init__
110+
111+
```python
112+
__init__(
113+
embedding_model: Any = "minishlab/potion-base-32M",
114+
threshold: float = 0.8,
115+
chunk_size: int = 2048,
116+
similarity_window: int = 3,
117+
min_sentences_per_chunk: int = 1,
118+
min_characters_per_sentence: int = 24,
119+
delim: Any = None,
120+
include_delim: str = "prev",
121+
skip_window: int = 0,
122+
filter_window: int = 5,
123+
filter_polyorder: int = 3,
124+
filter_tolerance: float = 0.2,
125+
) -> None
126+
```
127+
128+
Initializes the ChonkieSemanticChunker.
129+
130+
**Parameters:**
131+
132+
- **embedding_model** (<code>Any</code>) – The embedding model to use for semantic similarity.
133+
- **threshold** (<code>float</code>) – The semantic similarity threshold.
134+
- **chunk_size** (<code>int</code>) – The maximum size of each chunk.
135+
- **similarity_window** (<code>int</code>) – The window size for similarity calculations.
136+
- **min_sentences_per_chunk** (<code>int</code>) – The minimum number of sentences per chunk.
137+
- **min_characters_per_sentence** (<code>int</code>) – The minimum number of characters per sentence.
138+
- **delim** (<code>Any</code>) – Delimiters to use for splitting. If None, default delimiters are used.
139+
- **include_delim** (<code>str</code>) – Whether to include the delimiter in the chunks.
140+
- **skip_window** (<code>int</code>) – The skip window for similarity calculations.
141+
- **filter_window** (<code>int</code>) – The filter window for similarity calculations.
142+
- **filter_polyorder** (<code>int</code>) – The polynomial order for similarity filtering.
143+
- **filter_tolerance** (<code>float</code>) – The tolerance for similarity filtering.
144+
145+
#### run
146+
147+
```python
148+
run(documents: list[Document]) -> dict[str, Any]
149+
```
150+
151+
Splits a list of documents into smaller semantic chunks.
152+
153+
**Parameters:**
154+
155+
- **documents** (<code>list\[Document\]</code>) – The list of documents to split.
156+
157+
**Returns:**
158+
159+
- <code>dict\[str, Any\]</code> – A dictionary with the "documents" key containing the list of chunks.
160+
161+
#### to_dict
162+
163+
```python
164+
to_dict() -> dict[str, Any]
165+
```
166+
167+
Serializes the component to a dictionary.
168+
169+
**Returns:**
170+
171+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
172+
173+
#### from_dict
174+
175+
```python
176+
from_dict(data: dict[str, Any]) -> ChonkieSemanticChunker
177+
```
178+
179+
Deserializes the component from a dictionary.
180+
181+
**Parameters:**
182+
183+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
184+
185+
**Returns:**
186+
187+
- <code>ChonkieSemanticChunker</code> – Deserialized component.
188+
189+
## haystack_integrations.components.preprocessors.chonkie.sentence_chunker
190+
191+
### ChonkieSentenceChunker
192+
193+
A Document Splitter that uses Chonkie's SentenceChunker to split documents.
194+
195+
Usage::
196+
197+
```
198+
from haystack import Document
199+
from haystack_integrations.components.preprocessors.chonkie import ChonkieSentenceChunker
200+
201+
chunker = ChonkieSentenceChunker(chunk_size=512)
202+
documents = [Document(content="Hello world. This is a test.")]
203+
result = chunker.run(documents=documents)
204+
print(result["documents"])
205+
```
206+
207+
#### __init__
208+
209+
```python
210+
__init__(
211+
tokenizer: str = "character",
212+
chunk_size: int = 2048,
213+
chunk_overlap: int = 0,
214+
min_sentences_per_chunk: int = 1,
215+
min_characters_per_sentence: int = 12,
216+
approximate: bool = False,
217+
delim: Any = None,
218+
include_delim: str = "prev",
219+
) -> None
220+
```
221+
222+
Initializes the ChonkieSentenceChunker.
223+
224+
**Parameters:**
225+
226+
- **tokenizer** (<code>str</code>) – The tokenizer to use for chunking. Defaults to "character".
227+
- **chunk_size** (<code>int</code>) – The maximum size of each chunk.
228+
- **chunk_overlap** (<code>int</code>) – The overlap between consecutive chunks.
229+
- **min_sentences_per_chunk** (<code>int</code>) – The minimum number of sentences per chunk.
230+
- **min_characters_per_sentence** (<code>int</code>) – The minimum number of characters per sentence.
231+
- **approximate** (<code>bool</code>) – Whether to use approximate chunking.
232+
- **delim** (<code>Any</code>) – Delimiters to use for splitting. If None, default delimiters are used.
233+
- **include_delim** (<code>str</code>) – Whether to include the delimiter in the chunks ("prev" or "next").
234+
235+
#### run
236+
237+
```python
238+
run(documents: list[Document]) -> dict[str, Any]
239+
```
240+
241+
Splits a list of documents into smaller sentence-based chunks.
242+
243+
**Parameters:**
244+
245+
- **documents** (<code>list\[Document\]</code>) – The list of documents to split.
246+
247+
**Returns:**
248+
249+
- <code>dict\[str, Any\]</code> – A dictionary with the "documents" key containing the list of chunks.
250+
251+
#### to_dict
252+
253+
```python
254+
to_dict() -> dict[str, Any]
255+
```
256+
257+
Serializes the component to a dictionary.
258+
259+
**Returns:**
260+
261+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
262+
263+
#### from_dict
264+
265+
```python
266+
from_dict(data: dict[str, Any]) -> ChonkieSentenceChunker
267+
```
268+
269+
Deserializes the component from a dictionary.
270+
271+
**Parameters:**
272+
273+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
274+
275+
**Returns:**
276+
277+
- <code>ChonkieSentenceChunker</code> – Deserialized component.
278+
279+
## haystack_integrations.components.preprocessors.chonkie.token_chunker
280+
281+
### ChonkieTokenChunker
282+
283+
A Document Splitter that uses Chonkie's TokenChunker to split documents.
284+
285+
Usage::
286+
287+
```
288+
from haystack import Document
289+
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenChunker
290+
291+
chunker = ChonkieTokenChunker(chunk_size=512, chunk_overlap=50)
292+
documents = [Document(content="Hello world. This is a test.")]
293+
result = chunker.run(documents=documents)
294+
print(result["documents"])
295+
```
296+
297+
#### __init__
298+
299+
```python
300+
__init__(
301+
tokenizer: str = "character", chunk_size: int = 2048, chunk_overlap: int = 0
302+
) -> None
303+
```
304+
305+
Initializes the ChonkieTokenChunker.
306+
307+
**Parameters:**
308+
309+
- **tokenizer** (<code>str</code>) – The tokenizer to use for chunking. Defaults to "character".
310+
- **chunk_size** (<code>int</code>) – The maximum size of each chunk.
311+
- **chunk_overlap** (<code>int</code>) – The overlap between consecutive chunks.
312+
313+
#### run
314+
315+
```python
316+
run(documents: list[Document]) -> dict[str, Any]
317+
```
318+
319+
Splits a list of documents into smaller token-based chunks.
320+
321+
**Parameters:**
322+
323+
- **documents** (<code>list\[Document\]</code>) – The list of documents to split.
324+
325+
**Returns:**
326+
327+
- <code>dict\[str, Any\]</code> – A dictionary with the "documents" key containing the list of chunks.
328+
329+
#### to_dict
330+
331+
```python
332+
to_dict() -> dict[str, Any]
333+
```
334+
335+
Serializes the component to a dictionary.
336+
337+
**Returns:**
338+
339+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
340+
341+
#### from_dict
342+
343+
```python
344+
from_dict(data: dict[str, Any]) -> ChonkieTokenChunker
345+
```
346+
347+
Deserializes the component from a dictionary.
348+
349+
**Parameters:**
350+
351+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
352+
353+
**Returns:**
354+
355+
- <code>ChonkieTokenChunker</code> – Deserialized component.

integrations/chonkie/pydoc/config_docusaurus.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
loaders:
22
- modules:
3-
- haystack_integrations.components.preprocessors.chonkie.preprocessor
3+
- haystack_integrations.components.preprocessors.chonkie.recursive_chunker
4+
- haystack_integrations.components.preprocessors.chonkie.semantic_chunker
5+
- haystack_integrations.components.preprocessors.chonkie.sentence_chunker
6+
- haystack_integrations.components.preprocessors.chonkie.token_chunker
47
search_path: [../src]
58
processors:
69
- type: filter

0 commit comments

Comments
 (0)