You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`split_by`: The unit for splitting your documents. Choose from:
63
+
-**split_by** (<code>Literal['word', 'sentence', 'passage', 'page', 'line', 'period', 'function']</code>) – The unit for splitting your documents. Choose from:
65
64
-`word` for splitting by spaces (" ")
66
65
-`period` for splitting by periods (".")
67
-
-`page`for splitting by form feed ("\f")
68
-
-`passage`for splitting by double line breaks ("\n\n")
69
-
-`line`for splitting each line ("\n")
66
+
-`page` for splitting by form feed ("\\f")
67
+
-`passage` for splitting by double line breaks ("\\n\\n")
68
+
-`line` for splitting each line ("\\n")
70
69
-`sentence` for splitting by HanLP sentence tokenizer
71
-
-`split_length`: The maximum number of units in each split.
72
-
-`split_overlap`: The number of overlapping units for each split.
73
-
-`split_threshold`: The minimum number of units per split. If a split has fewer units
74
-
than the threshold, it's attached to the previous split.
75
-
-`respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
76
-
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
77
-
-`splitting_function`: Necessary when `split_by`isset to "function".
78
-
This is a function which must accept a single `str`asinputandreturn a `list` of `str`as output,
79
-
representing the chunks after splitting.
80
-
-`granularity`: The granularity of Chinese word segmentation, either 'coarse'or'fine'.
70
+
-**split_length** (<code>int</code>) – The maximum number of units in each split.
71
+
-**split_overlap** (<code>int</code>) – The number of overlapping units for each split.
72
+
-**split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
73
+
than the threshold, it's attached to the previous split.
74
+
-**respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
75
+
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
76
+
-**splitting_function** (<code>Callable | None</code>) – Necessary when `split_by` is set to "function".
77
+
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
78
+
representing the chunks after splitting.
79
+
-**granularity** (<code>Literal['coarse', 'fine']</code>) – The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
81
80
82
-
**Raises**:
81
+
**Raises:**
83
82
84
-
-`ValueError`: If the granularity isnot'coarse'or'fine'.
83
+
-<code>ValueError</code> – If the granularity is not 'coarse' or 'fine'.
-`split_by`: The unit for splitting your documents. Choose from:
63
+
-**split_by** (<code>Literal['word', 'sentence', 'passage', 'page', 'line', 'period', 'function']</code>) – The unit for splitting your documents. Choose from:
65
64
-`word` for splitting by spaces (" ")
66
65
-`period` for splitting by periods (".")
67
-
-`page`for splitting by form feed ("\f")
68
-
-`passage`for splitting by double line breaks ("\n\n")
69
-
-`line`for splitting each line ("\n")
66
+
-`page` for splitting by form feed ("\\f")
67
+
-`passage` for splitting by double line breaks ("\\n\\n")
68
+
-`line` for splitting each line ("\\n")
70
69
-`sentence` for splitting by HanLP sentence tokenizer
71
-
-`split_length`: The maximum number of units in each split.
72
-
-`split_overlap`: The number of overlapping units for each split.
73
-
-`split_threshold`: The minimum number of units per split. If a split has fewer units
74
-
than the threshold, it's attached to the previous split.
75
-
-`respect_sentence_boundary`: Choose whether to respect sentence boundaries when splitting by "word".
76
-
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
77
-
-`splitting_function`: Necessary when `split_by`isset to "function".
78
-
This is a function which must accept a single `str`asinputandreturn a `list` of `str`as output,
79
-
representing the chunks after splitting.
80
-
-`granularity`: The granularity of Chinese word segmentation, either 'coarse'or'fine'.
81
-
82
-
**Raises**:
70
+
-**split_length** (<code>int</code>) – The maximum number of units in each split.
71
+
-**split_overlap** (<code>int</code>) – The number of overlapping units for each split.
72
+
-**split_threshold** (<code>int</code>) – The minimum number of units per split. If a split has fewer units
73
+
than the threshold, it's attached to the previous split.
74
+
-**respect_sentence_boundary** (<code>bool</code>) – Choose whether to respect sentence boundaries when splitting by "word".
75
+
If True, uses HanLP to detect sentence boundaries, ensuring splits occur only between sentences.
76
+
-**splitting_function** (<code>Callable | None</code>) – Necessary when `split_by` is set to "function".
77
+
This is a function which must accept a single `str` as input and return a `list` of `str` as output,
78
+
representing the chunks after splitting.
79
+
-**granularity** (<code>Literal['coarse', 'fine']</code>) – The granularity of Chinese word segmentation, either 'coarse' or 'fine'.
83
80
84
-
-`ValueError`: If the granularity isnot'coarse'or'fine'.
0 commit comments