|
| 1 | +--- |
| 2 | +title: "Tika" |
| 3 | +id: integrations-tika |
| 4 | +description: "Tika integration for Haystack" |
| 5 | +slug: "/integrations-tika" |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +## haystack_integrations.components.converters.tika.converter |
| 10 | + |
| 11 | +### XHTMLParser |
| 12 | + |
| 13 | +Bases: <code>HTMLParser</code> |
| 14 | + |
| 15 | +Custom parser to extract pages from Tika XHTML content. |
| 16 | + |
| 17 | +#### __init__ |
| 18 | + |
| 19 | +```python |
| 20 | +__init__() -> None |
| 21 | +``` |
| 22 | + |
| 23 | +Initialize the XHTMLParser. |
| 24 | + |
| 25 | +#### handle_starttag |
| 26 | + |
| 27 | +```python |
| 28 | +handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None |
| 29 | +``` |
| 30 | + |
| 31 | +Identify the start of a page div. |
| 32 | + |
| 33 | +**Parameters:** |
| 34 | + |
| 35 | +- **tag** (<code>str</code>) – The HTML tag name. |
| 36 | +- **attrs** (<code>list\[tuple\[str, str | None\]\]</code>) – The HTML tag attributes. |
| 37 | + |
| 38 | +#### handle_endtag |
| 39 | + |
| 40 | +```python |
| 41 | +handle_endtag(tag: str) -> None |
| 42 | +``` |
| 43 | + |
| 44 | +Identify the end of a page div. |
| 45 | + |
| 46 | +**Parameters:** |
| 47 | + |
| 48 | +- **tag** (<code>str</code>) – The HTML tag name. |
| 49 | + |
| 50 | +#### handle_data |
| 51 | + |
| 52 | +```python |
| 53 | +handle_data(data: str) -> None |
| 54 | +``` |
| 55 | + |
| 56 | +Populate the page content. |
| 57 | + |
| 58 | +**Parameters:** |
| 59 | + |
| 60 | +- **data** (<code>str</code>) – The text content of an HTML node. |
| 61 | + |
| 62 | +### TikaDocumentConverter |
| 63 | + |
| 64 | +Converts files of different types to Documents using Apache Tika. |
| 65 | + |
| 66 | +This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore, |
| 67 | +requires a running Tika server. |
| 68 | +For more options on running Tika, |
| 69 | +see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage). |
| 70 | + |
| 71 | +Usage example: |
| 72 | + |
| 73 | +```python |
| 74 | +from haystack_integrations.components.converters.tika import TikaDocumentConverter |
| 75 | +from datetime import datetime |
| 76 | + |
| 77 | +converter = TikaDocumentConverter() |
| 78 | +results = converter.run( |
| 79 | + sources=["sample.docx", "my_document.rtf", "archive.zip"], |
| 80 | + meta={"date_added": datetime.now().isoformat()} |
| 81 | +) |
| 82 | +documents = results["documents"] |
| 83 | + |
| 84 | +print(documents[0].content) |
| 85 | +# >> 'This is a text from the docx file.' |
| 86 | +``` |
| 87 | + |
| 88 | +#### __init__ |
| 89 | + |
| 90 | +```python |
| 91 | +__init__( |
| 92 | + tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False |
| 93 | +) -> None |
| 94 | +``` |
| 95 | + |
| 96 | +Create a TikaDocumentConverter component. |
| 97 | + |
| 98 | +**Parameters:** |
| 99 | + |
| 100 | +- **tika_url** (<code>str</code>) – Tika server URL. |
| 101 | +- **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document. |
| 102 | + If False, only the file name is stored. |
| 103 | + |
| 104 | +#### run |
| 105 | + |
| 106 | +```python |
| 107 | +run( |
| 108 | + sources: list[str | Path | ByteStream], |
| 109 | + meta: dict[str, Any] | list[dict[str, Any]] | None = None, |
| 110 | +) -> dict[str, list[Document]] |
| 111 | +``` |
| 112 | + |
| 113 | +Convert files to Documents. |
| 114 | + |
| 115 | +**Parameters:** |
| 116 | + |
| 117 | +- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects. |
| 118 | +- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. |
| 119 | + This value can be either a list of dictionaries or a single dictionary. |
| 120 | + If it's a single dictionary, its content is added to the metadata of all produced Documents. |
| 121 | + If it's a list, the length of the list must match the number of sources, because the two lists will |
| 122 | + be zipped. |
| 123 | + If `sources` contains ByteStream objects, their `meta` will be added to the output Documents. |
| 124 | + |
| 125 | +**Returns:** |
| 126 | + |
| 127 | +- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: |
| 128 | +- `documents`: Created Documents |
0 commit comments