Skip to content

Commit 8aef2a4

Browse files
docs: sync Core Integrations API reference (langdetect) on Docusaurus (#11674)
Co-authored-by: julian-risch <4181769+julian-risch@users.noreply.github.com>
1 parent cb30727 commit 8aef2a4

14 files changed

Lines changed: 2282 additions & 0 deletions

File tree

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: "Langdetect"
3+
id: integrations-langdetect
4+
description: "Langdetect integration for Haystack"
5+
slug: "/integrations-langdetect"
6+
---
7+
8+
9+
## haystack_integrations.components.classifiers.langdetect.document_language_classifier
10+
11+
### DocumentLanguageClassifier
12+
13+
Classifies the language of each document and adds it to its metadata.
14+
15+
Provide a list of languages during initialization. If the document's text doesn't match any of the
16+
specified languages, the metadata value is set to "unmatched".
17+
To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier.
18+
For routing plain text, use the TextLanguageRouter component instead.
19+
20+
### Usage example
21+
22+
```python
23+
from haystack import Document, Pipeline
24+
from haystack.document_stores.in_memory import InMemoryDocumentStore
25+
from haystack_integrations.components.classifiers.langdetect import DocumentLanguageClassifier
26+
from haystack.components.routers import MetadataRouter
27+
from haystack.components.writers import DocumentWriter
28+
29+
docs = [Document(id="1", content="This is an English document"),
30+
Document(id="2", content="Este es un documento en español")]
31+
32+
document_store = InMemoryDocumentStore()
33+
34+
p = Pipeline()
35+
p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
36+
p.add_component(
37+
instance=MetadataRouter(rules={
38+
"en": {
39+
"field": "meta.language",
40+
"operator": "==",
41+
"value": "en"
42+
}
43+
}),
44+
name="router")
45+
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
46+
p.connect("language_classifier.documents", "router.documents")
47+
p.connect("router.en", "writer.documents")
48+
49+
p.run({"language_classifier": {"documents": docs}})
50+
51+
written_docs = document_store.filter_documents()
52+
assert len(written_docs) == 1
53+
assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
54+
```
55+
56+
#### __init__
57+
58+
```python
59+
__init__(languages: list[str] | None = None) -> None
60+
```
61+
62+
Initializes the DocumentLanguageClassifier component.
63+
64+
**Parameters:**
65+
66+
- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes.
67+
See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
68+
If not specified, defaults to ["en"].
69+
70+
#### run
71+
72+
```python
73+
run(documents: list[Document]) -> dict[str, list[Document]]
74+
```
75+
76+
Classifies the language of each document and adds it to its metadata.
77+
78+
If the document's text doesn't match any of the languages specified at initialization,
79+
sets the metadata value to "unmatched".
80+
81+
**Parameters:**
82+
83+
- **documents** (<code>list\[Document\]</code>) – A list of documents for language classification.
84+
85+
**Returns:**
86+
87+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
88+
- `documents`: A list of documents with an added `language` metadata field.
89+
90+
**Raises:**
91+
92+
- <code>TypeError</code> – if the input is not a list of Documents.
93+
94+
## haystack_integrations.components.routers.langdetect.text_language_router
95+
96+
### TextLanguageRouter
97+
98+
Routes text strings to different output connections based on their language.
99+
100+
Provide a list of languages during initialization. If the document's text doesn't match any of the
101+
specified languages, the metadata value is set to "unmatched".
102+
For routing documents based on their language, use the DocumentLanguageClassifier component,
103+
followed by the MetaDataRouter.
104+
105+
### Usage example
106+
107+
```python
108+
from haystack import Pipeline, Document
109+
from haystack_integrations.components.routers.langdetect import TextLanguageRouter
110+
from haystack.document_stores.in_memory import InMemoryDocumentStore
111+
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
112+
113+
document_store = InMemoryDocumentStore()
114+
document_store.write_documents([Document(content="Elvis Presley was an American singer and actor.")])
115+
116+
p = Pipeline()
117+
p.add_component(instance=TextLanguageRouter(languages=["en"]), name="text_language_router")
118+
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")
119+
p.connect("text_language_router.en", "retriever.query")
120+
121+
result = p.run({"text_language_router": {"text": "Who was Elvis Presley?"}})
122+
assert result["retriever"]["documents"][0].content == "Elvis Presley was an American singer and actor."
123+
124+
result = p.run({"text_language_router": {"text": "ένα ελληνικό κείμενο"}})
125+
assert result["text_language_router"]["unmatched"] == "ένα ελληνικό κείμενο"
126+
```
127+
128+
#### __init__
129+
130+
```python
131+
__init__(languages: list[str] | None = None) -> None
132+
```
133+
134+
Initialize the TextLanguageRouter component.
135+
136+
**Parameters:**
137+
138+
- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes.
139+
See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
140+
If not specified, defaults to ["en"].
141+
142+
#### run
143+
144+
```python
145+
run(text: str) -> dict[str, str]
146+
```
147+
148+
Routes the text strings to different output connections based on their language.
149+
150+
If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched".
151+
152+
**Parameters:**
153+
154+
- **text** (<code>str</code>) – A text string to route.
155+
156+
**Returns:**
157+
158+
- <code>dict\[str, str\]</code> – A dictionary in which the key is the language (or `"unmatched"`),
159+
and the value is the text.
160+
161+
**Raises:**
162+
163+
- <code>TypeError</code> – If the input is not a string.
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: "Langdetect"
3+
id: integrations-langdetect
4+
description: "Langdetect integration for Haystack"
5+
slug: "/integrations-langdetect"
6+
---
7+
8+
9+
## haystack_integrations.components.classifiers.langdetect.document_language_classifier
10+
11+
### DocumentLanguageClassifier
12+
13+
Classifies the language of each document and adds it to its metadata.
14+
15+
Provide a list of languages during initialization. If the document's text doesn't match any of the
16+
specified languages, the metadata value is set to "unmatched".
17+
To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier.
18+
For routing plain text, use the TextLanguageRouter component instead.
19+
20+
### Usage example
21+
22+
```python
23+
from haystack import Document, Pipeline
24+
from haystack.document_stores.in_memory import InMemoryDocumentStore
25+
from haystack_integrations.components.classifiers.langdetect import DocumentLanguageClassifier
26+
from haystack.components.routers import MetadataRouter
27+
from haystack.components.writers import DocumentWriter
28+
29+
docs = [Document(id="1", content="This is an English document"),
30+
Document(id="2", content="Este es un documento en español")]
31+
32+
document_store = InMemoryDocumentStore()
33+
34+
p = Pipeline()
35+
p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
36+
p.add_component(
37+
instance=MetadataRouter(rules={
38+
"en": {
39+
"field": "meta.language",
40+
"operator": "==",
41+
"value": "en"
42+
}
43+
}),
44+
name="router")
45+
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
46+
p.connect("language_classifier.documents", "router.documents")
47+
p.connect("router.en", "writer.documents")
48+
49+
p.run({"language_classifier": {"documents": docs}})
50+
51+
written_docs = document_store.filter_documents()
52+
assert len(written_docs) == 1
53+
assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
54+
```
55+
56+
#### __init__
57+
58+
```python
59+
__init__(languages: list[str] | None = None) -> None
60+
```
61+
62+
Initializes the DocumentLanguageClassifier component.
63+
64+
**Parameters:**
65+
66+
- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes.
67+
See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
68+
If not specified, defaults to ["en"].
69+
70+
#### run
71+
72+
```python
73+
run(documents: list[Document]) -> dict[str, list[Document]]
74+
```
75+
76+
Classifies the language of each document and adds it to its metadata.
77+
78+
If the document's text doesn't match any of the languages specified at initialization,
79+
sets the metadata value to "unmatched".
80+
81+
**Parameters:**
82+
83+
- **documents** (<code>list\[Document\]</code>) – A list of documents for language classification.
84+
85+
**Returns:**
86+
87+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
88+
- `documents`: A list of documents with an added `language` metadata field.
89+
90+
**Raises:**
91+
92+
- <code>TypeError</code> – if the input is not a list of Documents.
93+
94+
## haystack_integrations.components.routers.langdetect.text_language_router
95+
96+
### TextLanguageRouter
97+
98+
Routes text strings to different output connections based on their language.
99+
100+
Provide a list of languages during initialization. If the document's text doesn't match any of the
101+
specified languages, the metadata value is set to "unmatched".
102+
For routing documents based on their language, use the DocumentLanguageClassifier component,
103+
followed by the MetaDataRouter.
104+
105+
### Usage example
106+
107+
```python
108+
from haystack import Pipeline, Document
109+
from haystack_integrations.components.routers.langdetect import TextLanguageRouter
110+
from haystack.document_stores.in_memory import InMemoryDocumentStore
111+
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
112+
113+
document_store = InMemoryDocumentStore()
114+
document_store.write_documents([Document(content="Elvis Presley was an American singer and actor.")])
115+
116+
p = Pipeline()
117+
p.add_component(instance=TextLanguageRouter(languages=["en"]), name="text_language_router")
118+
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")
119+
p.connect("text_language_router.en", "retriever.query")
120+
121+
result = p.run({"text_language_router": {"text": "Who was Elvis Presley?"}})
122+
assert result["retriever"]["documents"][0].content == "Elvis Presley was an American singer and actor."
123+
124+
result = p.run({"text_language_router": {"text": "ένα ελληνικό κείμενο"}})
125+
assert result["text_language_router"]["unmatched"] == "ένα ελληνικό κείμενο"
126+
```
127+
128+
#### __init__
129+
130+
```python
131+
__init__(languages: list[str] | None = None) -> None
132+
```
133+
134+
Initialize the TextLanguageRouter component.
135+
136+
**Parameters:**
137+
138+
- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes.
139+
See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
140+
If not specified, defaults to ["en"].
141+
142+
#### run
143+
144+
```python
145+
run(text: str) -> dict[str, str]
146+
```
147+
148+
Routes the text strings to different output connections based on their language.
149+
150+
If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched".
151+
152+
**Parameters:**
153+
154+
- **text** (<code>str</code>) – A text string to route.
155+
156+
**Returns:**
157+
158+
- <code>dict\[str, str\]</code> – A dictionary in which the key is the language (or `"unmatched"`),
159+
and the value is the text.
160+
161+
**Raises:**
162+
163+
- <code>TypeError</code> – If the input is not a string.

0 commit comments

Comments
 (0)