Skip to content

Commit f334c96

Browse files
docs: sync Core Integrations API reference (hanlp) on Docusaurus (#11171)
Co-authored-by: anakin87 <44616784+anakin87@users.noreply.github.com>
1 parent 2487503 commit f334c96

12 files changed

Lines changed: 612 additions & 779 deletions

File tree

docs-website/reference/integrations-api/hanlp.md

Lines changed: 51 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,8 @@ description: "HanLP integration for Haystack"
55
slug: "/integrations-hanlp"
66
---
77

8-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter"></a>
98

10-
## Module haystack\_integrations.components.preprocessors.hanlp.chinese\_document\_splitter
11-
12-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter"></a>
9+
## haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter
1310

1411
### ChineseDocumentSplitter
1512

@@ -28,6 +25,7 @@ Therefore, splitting by word means splitting by these multi-character tokens,
2825
not simply by single characters or spaces.
2926

3027
### Usage example
28+
3129
```python
3230
doc = Document(content=
3331
"这是第一句话,这是第二句话,这是第三句话。"
@@ -42,116 +40,104 @@ result = splitter.run(documents=[doc])
4240
print(result["documents"])
4341
```
4442

45-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.__init__"></a>
46-
47-
#### ChineseDocumentSplitter.\_\_init\_\_
43+
#### __init__
4844

4945
```python
50-
def __init__(split_by: Literal["word", "sentence", "passage", "page", "line",
51-
"period", "function"] = "word",
52-
split_length: int = 1000,
53-
split_overlap: int = 200,
54-
split_threshold: int = 0,
55-
respect_sentence_boundary: bool = False,
56-
splitting_function: Callable | None = None,
57-
granularity: Literal["coarse", "fine"] = "coarse") -> None
46+
__init__(
47+
split_by: Literal[
48+
"word", "sentence", "passage", "page", "line", "period", "function"
49+
] = "word",
50+
split_length: int = 1000,
51+
split_overlap: int = 200,
52+
split_threshold: int = 0,
53+
respect_sentence_boundary: bool = False,
54+
splitting_function: Callable | None = None,
55+
granularity: Literal["coarse", "fine"] = "coarse",
56+
) -> None
5857
```
5958

6059
Initialize the ChineseDocumentSplitter component.
6160

62-
**Arguments**:
61+
**Parameters:**
6362

64-
- `split_by`: The unit for splitting your documents. Choose from:
63+
- **split_by** (<code>Literal['word', 'sentence', 'passage', 'page', 'line', 'period', 'function']</code>) – The unit for splitting your documents. Choose from:
6564
- `word` for splitting by spaces (" ")
6665
- `period` for splitting by periods (".")
67-
- `page` for splitting by form feed ("\f")
68-
- `passage` for splitting by double line breaks ("\n\n")
69-
- `line` for splitting each line ("\n")
66+
- `page` for splitting by form feed ("\\f")
67+
- `passage` for splitting by double line breaks ("\\n\\n")
68+
- `line` for splitting each line ("\\n")
7069
- `sentence` for splitting by HanLP sentence tokenizer
71-
- `split_length`: The maximum number of units in each split.
72-
- `split_overlap`: The number of overlapping units for each split.
73-
- `split_threshold`: The minimum number of units per split. If a split has fewer units
74-
than the threshold, it's attached to the previous split.
75-
- `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
76-
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
77-
- `splitting_function`: Necessary when `split_by` is set to "function".
78-
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
79-
representing the chunks after splitting.
80-
- `granularity`: The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
70+
- **split_length** (<code>int</code>) – The maximum number of units in each split.
71+
- **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
72+
- **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
73+
than the threshold, it's attached to the previous split.
74+
- **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
75+
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
76+
- **splitting_function** (<code>Callable | None</code>) – Necessary when `split_by` is set to "function".
77+
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
78+
representing the chunks after splitting.
79+
- **granularity** (<code>Literal['coarse', 'fine']</code>) – The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
8180

82-
**Raises**:
81+
**Raises:**
8382

84-
- `ValueError`: If the granularity is not 'coarse' or 'fine'.
83+
- <code>ValueError</code> – If the granularity is not 'coarse' or 'fine'.
8584

86-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.run"></a>
87-
88-
#### ChineseDocumentSplitter.run
85+
#### run
8986

9087
```python
91-
@component.output_types(documents=list[Document])
92-
def run(documents: list[Document]) -> dict[str, list[Document]]
88+
run(documents: list[Document]) -> dict[str, list[Document]]
9389
```
9490

9591
Split documents into smaller chunks.
9692

97-
**Arguments**:
98-
99-
- `documents`: The documents to split.
93+
**Parameters:**
10094

101-
**Raises**:
95+
- **documents** (<code>list\[Document\]</code>) – The documents to split.
10296

103-
- `RuntimeError`: If the Chinese word segmentation model is not loaded.
97+
**Returns:**
10498

105-
**Returns**:
99+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary containing the split documents.
106100

107-
A dictionary containing the split documents.
101+
**Raises:**
108102

109-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.warm_up"></a>
103+
- <code>RuntimeError</code> – If the Chinese word segmentation model is not loaded.
110104

111-
#### ChineseDocumentSplitter.warm\_up
105+
#### warm_up
112106

113107
```python
114-
def warm_up() -> None
108+
warm_up() -> None
115109
```
116110

117111
Warm up the component by loading the necessary models.
118112

119-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.chinese_sentence_split"></a>
120-
121-
#### ChineseDocumentSplitter.chinese\_sentence\_split
113+
#### chinese_sentence_split
122114

123115
```python
124-
def chinese_sentence_split(text: str) -> list[dict[str, Any]]
116+
chinese_sentence_split(text: str) -> list[dict[str, Any]]
125117
```
126118

127119
Split Chinese text into sentences.
128120

129-
**Arguments**:
130-
131-
- `text`: The text to split.
121+
**Parameters:**
132122

133-
**Returns**:
123+
- **text** (<code>str</code>) – The text to split.
134124

135-
A list of split sentences.
125+
**Returns:**
136126

137-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.to_dict"></a>
127+
- <code>list\[dict\[str, Any\]\]</code> – A list of split sentences.
138128

139-
#### ChineseDocumentSplitter.to\_dict
129+
#### to_dict
140130

141131
```python
142-
def to_dict() -> dict[str, Any]
132+
to_dict() -> dict[str, Any]
143133
```
144134

145135
Serializes the component to a dictionary.
146136

147-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.from_dict"></a>
148-
149-
#### ChineseDocumentSplitter.from\_dict
137+
#### from_dict
150138

151139
```python
152-
@classmethod
153-
def from_dict(cls, data: dict[str, Any]) -> "ChineseDocumentSplitter"
140+
from_dict(data: dict[str, Any]) -> ChineseDocumentSplitter
154141
```
155142

156143
Deserializes the component from a dictionary.
157-

docs-website/reference_versioned_docs/version-2.18/integrations-api/hanlp.md

Lines changed: 51 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,8 @@ description: "HanLP integration for Haystack"
55
slug: "/integrations-hanlp"
66
---
77

8-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter"></a>
98

10-
## Module haystack\_integrations.components.preprocessors.hanlp.chinese\_document\_splitter
11-
12-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter"></a>
9+
## haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter
1310

1411
### ChineseDocumentSplitter
1512

@@ -28,6 +25,7 @@ Therefore, splitting by word means splitting by these multi-character tokens,
2825
not simply by single characters or spaces.
2926

3027
### Usage example
28+
3129
```python
3230
doc = Document(content=
3331
"这是第一句话,这是第二句话,这是第三句话。"
@@ -42,115 +40,104 @@ result = splitter.run(documents=[doc])
4240
print(result["documents"])
4341
```
4442

45-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.__init__"></a>
46-
47-
#### ChineseDocumentSplitter.\_\_init\_\_
43+
#### __init__
4844

4945
```python
50-
def __init__(split_by: Literal["word", "sentence", "passage", "page", "line",
51-
"period", "function"] = "word",
52-
split_length: int = 1000,
53-
split_overlap: int = 200,
54-
split_threshold: int = 0,
55-
respect_sentence_boundary: bool = False,
56-
splitting_function: Callable | None = None,
57-
granularity: Literal["coarse", "fine"] = "coarse") -> None
46+
__init__(
47+
split_by: Literal[
48+
"word", "sentence", "passage", "page", "line", "period", "function"
49+
] = "word",
50+
split_length: int = 1000,
51+
split_overlap: int = 200,
52+
split_threshold: int = 0,
53+
respect_sentence_boundary: bool = False,
54+
splitting_function: Callable | None = None,
55+
granularity: Literal["coarse", "fine"] = "coarse",
56+
) -> None
5857
```
5958

6059
Initialize the ChineseDocumentSplitter component.
6160

62-
**Arguments**:
61+
**Parameters:**
6362

64-
- `split_by`: The unit for splitting your documents. Choose from:
63+
- **split_by** (<code>Literal['word', 'sentence', 'passage', 'page', 'line', 'period', 'function']</code>) – The unit for splitting your documents. Choose from:
6564
- `word` for splitting by spaces (" ")
6665
- `period` for splitting by periods (".")
67-
- `page` for splitting by form feed ("\f")
68-
- `passage` for splitting by double line breaks ("\n\n")
69-
- `line` for splitting each line ("\n")
66+
- `page` for splitting by form feed ("\\f")
67+
- `passage` for splitting by double line breaks ("\\n\\n")
68+
- `line` for splitting each line ("\\n")
7069
- `sentence` for splitting by HanLP sentence tokenizer
71-
- `split_length`: The maximum number of units in each split.
72-
- `split_overlap`: The number of overlapping units for each split.
73-
- `split_threshold`: The minimum number of units per split. If a split has fewer units
74-
than the threshold, it's attached to the previous split.
75-
- `respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
76-
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
77-
- `splitting_function`: Necessary when `split_by` is set to "function".
78-
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
79-
representing the chunks after splitting.
80-
- `granularity`: The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
81-
82-
**Raises**:
70+
- **split_length** (<code>int</code>) – The maximum number of units in each split.
71+
- **split_overlap** (<code>int</code>) – The number of overlapping units for each split.
72+
- **split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
73+
than the threshold, it's attached to the previous split.
74+
- **respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
75+
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
76+
- **splitting_function** (<code>Callable | None</code>) – Necessary when `split_by` is set to "function".
77+
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
78+
representing the chunks after splitting.
79+
- **granularity** (<code>Literal['coarse', 'fine']</code>) – The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
8380

84-
- `ValueError`: If the granularity is not 'coarse' or 'fine'.
81+
**Raises:**
8582

86-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.run"></a>
83+
- <code>ValueError</code> – If the granularity is not 'coarse' or 'fine'.
8784

88-
#### ChineseDocumentSplitter.run
85+
#### run
8986

9087
```python
91-
@component.output_types(documents=list[Document])
92-
def run(documents: list[Document]) -> dict[str, list[Document]]
88+
run(documents: list[Document]) -> dict[str, list[Document]]
9389
```
9490

9591
Split documents into smaller chunks.
9692

97-
**Arguments**:
93+
**Parameters:**
9894

99-
- `documents`: The documents to split.
95+
- **documents** (<code>list\[Document\]</code>) – The documents to split.
10096

101-
**Raises**:
97+
**Returns:**
10298

103-
- `RuntimeError`: If the Chinese word segmentation model is not loaded.
99+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary containing the split documents.
104100

105-
**Returns**:
101+
**Raises:**
106102

107-
A dictionary containing the split documents.
103+
- <code>RuntimeError</code> – If the Chinese word segmentation model is not loaded.
108104

109-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.warm_up"></a>
110-
111-
#### ChineseDocumentSplitter.warm\_up
105+
#### warm_up
112106

113107
```python
114-
def warm_up() -> None
108+
warm_up() -> None
115109
```
116110

117111
Warm up the component by loading the necessary models.
118112

119-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.chinese_sentence_split"></a>
120-
121-
#### ChineseDocumentSplitter.chinese\_sentence\_split
113+
#### chinese_sentence_split
122114

123115
```python
124-
def chinese_sentence_split(text: str) -> list[dict[str, Any]]
116+
chinese_sentence_split(text: str) -> list[dict[str, Any]]
125117
```
126118

127119
Split Chinese text into sentences.
128120

129-
**Arguments**:
121+
**Parameters:**
130122

131-
- `text`: The text to split.
123+
- **text** (<code>str</code>) – The text to split.
132124

133-
**Returns**:
125+
**Returns:**
134126

135-
A list of split sentences.
127+
- <code>list\[dict\[str, Any\]\]</code> – A list of split sentences.
136128

137-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.to_dict"></a>
138-
139-
#### ChineseDocumentSplitter.to\_dict
129+
#### to_dict
140130

141131
```python
142-
def to_dict() -> dict[str, Any]
132+
to_dict() -> dict[str, Any]
143133
```
144134

145135
Serializes the component to a dictionary.
146136

147-
<a id="haystack_integrations.components.preprocessors.hanlp.chinese_document_splitter.ChineseDocumentSplitter.from_dict"></a>
148-
149-
#### ChineseDocumentSplitter.from\_dict
137+
#### from_dict
150138

151139
```python
152-
@classmethod
153-
def from_dict(cls, data: dict[str, Any]) -> "ChineseDocumentSplitter"
140+
from_dict(data: dict[str, Any]) -> ChineseDocumentSplitter
154141
```
155142

156143
Deserializes the component from a dictionary.

0 commit comments

Comments
 (0)