|
| 1 | +--- |
| 2 | +title: "Langdetect" |
| 3 | +id: integrations-langdetect |
| 4 | +description: "Langdetect integration for Haystack" |
| 5 | +slug: "/integrations-langdetect" |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +## haystack_integrations.components.classifiers.langdetect.document_language_classifier |
| 10 | + |
| 11 | +### DocumentLanguageClassifier |
| 12 | + |
| 13 | +Classifies the language of each document and adds it to its metadata. |
| 14 | + |
| 15 | +Provide a list of languages during initialization. If the document's text doesn't match any of the |
| 16 | +specified languages, the metadata value is set to "unmatched". |
| 17 | +To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier. |
| 18 | +For routing plain text, use the TextLanguageRouter component instead. |
| 19 | + |
| 20 | +### Usage example |
| 21 | + |
| 22 | +```python |
| 23 | +from haystack import Document, Pipeline |
| 24 | +from haystack.document_stores.in_memory import InMemoryDocumentStore |
| 25 | +from haystack_integrations.components.classifiers.langdetect import DocumentLanguageClassifier |
| 26 | +from haystack.components.routers import MetadataRouter |
| 27 | +from haystack.components.writers import DocumentWriter |
| 28 | + |
| 29 | +docs = [Document(id="1", content="This is an English document"), |
| 30 | + Document(id="2", content="Este es un documento en español")] |
| 31 | + |
| 32 | +document_store = InMemoryDocumentStore() |
| 33 | + |
| 34 | +p = Pipeline() |
| 35 | +p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier") |
| 36 | +p.add_component( |
| 37 | +instance=MetadataRouter(rules={ |
| 38 | + "en": { |
| 39 | + "field": "meta.language", |
| 40 | + "operator": "==", |
| 41 | + "value": "en" |
| 42 | + } |
| 43 | +}), |
| 44 | +name="router") |
| 45 | +p.add_component(instance=DocumentWriter(document_store=document_store), name="writer") |
| 46 | +p.connect("language_classifier.documents", "router.documents") |
| 47 | +p.connect("router.en", "writer.documents") |
| 48 | + |
| 49 | +p.run({"language_classifier": {"documents": docs}}) |
| 50 | + |
| 51 | +written_docs = document_store.filter_documents() |
| 52 | +assert len(written_docs) == 1 |
| 53 | +assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"}) |
| 54 | +``` |
| 55 | + |
| 56 | +#### __init__ |
| 57 | + |
| 58 | +```python |
| 59 | +__init__(languages: list[str] | None = None) -> None |
| 60 | +``` |
| 61 | + |
| 62 | +Initializes the DocumentLanguageClassifier component. |
| 63 | + |
| 64 | +**Parameters:** |
| 65 | + |
| 66 | +- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes. |
| 67 | + See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages). |
| 68 | + If not specified, defaults to ["en"]. |
| 69 | + |
| 70 | +#### run |
| 71 | + |
| 72 | +```python |
| 73 | +run(documents: list[Document]) -> dict[str, list[Document]] |
| 74 | +``` |
| 75 | + |
| 76 | +Classifies the language of each document and adds it to its metadata. |
| 77 | + |
| 78 | +If the document's text doesn't match any of the languages specified at initialization, |
| 79 | +sets the metadata value to "unmatched". |
| 80 | + |
| 81 | +**Parameters:** |
| 82 | + |
| 83 | +- **documents** (<code>list\[Document\]</code>) – A list of documents for language classification. |
| 84 | + |
| 85 | +**Returns:** |
| 86 | + |
| 87 | +- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key: |
| 88 | +- `documents`: A list of documents with an added `language` metadata field. |
| 89 | + |
| 90 | +**Raises:** |
| 91 | + |
| 92 | +- <code>TypeError</code> – if the input is not a list of Documents. |
| 93 | + |
| 94 | +## haystack_integrations.components.routers.langdetect.text_language_router |
| 95 | + |
| 96 | +### TextLanguageRouter |
| 97 | + |
| 98 | +Routes text strings to different output connections based on their language. |
| 99 | + |
| 100 | +Provide a list of languages during initialization. If the document's text doesn't match any of the |
| 101 | +specified languages, the metadata value is set to "unmatched". |
| 102 | +For routing documents based on their language, use the DocumentLanguageClassifier component, |
| 103 | +followed by the MetaDataRouter. |
| 104 | + |
| 105 | +### Usage example |
| 106 | + |
| 107 | +```python |
| 108 | +from haystack import Pipeline, Document |
| 109 | +from haystack_integrations.components.routers.langdetect import TextLanguageRouter |
| 110 | +from haystack.document_stores.in_memory import InMemoryDocumentStore |
| 111 | +from haystack.components.retrievers.in_memory import InMemoryBM25Retriever |
| 112 | + |
| 113 | +document_store = InMemoryDocumentStore() |
| 114 | +document_store.write_documents([Document(content="Elvis Presley was an American singer and actor.")]) |
| 115 | + |
| 116 | +p = Pipeline() |
| 117 | +p.add_component(instance=TextLanguageRouter(languages=["en"]), name="text_language_router") |
| 118 | +p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever") |
| 119 | +p.connect("text_language_router.en", "retriever.query") |
| 120 | + |
| 121 | +result = p.run({"text_language_router": {"text": "Who was Elvis Presley?"}}) |
| 122 | +assert result["retriever"]["documents"][0].content == "Elvis Presley was an American singer and actor." |
| 123 | + |
| 124 | +result = p.run({"text_language_router": {"text": "ένα ελληνικό κείμενο"}}) |
| 125 | +assert result["text_language_router"]["unmatched"] == "ένα ελληνικό κείμενο" |
| 126 | +``` |
| 127 | + |
| 128 | +#### __init__ |
| 129 | + |
| 130 | +```python |
| 131 | +__init__(languages: list[str] | None = None) -> None |
| 132 | +``` |
| 133 | + |
| 134 | +Initialize the TextLanguageRouter component. |
| 135 | + |
| 136 | +**Parameters:** |
| 137 | + |
| 138 | +- **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes. |
| 139 | + See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages). |
| 140 | + If not specified, defaults to ["en"]. |
| 141 | + |
| 142 | +#### run |
| 143 | + |
| 144 | +```python |
| 145 | +run(text: str) -> dict[str, str] |
| 146 | +``` |
| 147 | + |
| 148 | +Routes the text strings to different output connections based on their language. |
| 149 | + |
| 150 | +If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched". |
| 151 | + |
| 152 | +**Parameters:** |
| 153 | + |
| 154 | +- **text** (<code>str</code>) – A text string to route. |
| 155 | + |
| 156 | +**Returns:** |
| 157 | + |
| 158 | +- <code>dict\[str, str\]</code> – A dictionary in which the key is the language (or `"unmatched"`), |
| 159 | + and the value is the text. |
| 160 | + |
| 161 | +**Raises:** |
| 162 | + |
| 163 | +- <code>TypeError</code> – If the input is not a string. |
0 commit comments