Skip to content

Commit d5791aa

Browse files
docs: sync Core Integrations API reference (tika) on Docusaurus (#11677)
Co-authored-by: davidsbatista <7937824+davidsbatista@users.noreply.github.com>
1 parent 40db744 commit d5791aa

14 files changed

Lines changed: 1792 additions & 0 deletions

File tree

  • docs-website
    • reference_versioned_docs
      • version-2.18/integrations-api
      • version-2.19/integrations-api
      • version-2.20/integrations-api
      • version-2.21/integrations-api
      • version-2.22/integrations-api
      • version-2.23/integrations-api
      • version-2.24/integrations-api
      • version-2.25/integrations-api
      • version-2.26/integrations-api
      • version-2.27/integrations-api
      • version-2.28/integrations-api
      • version-2.29/integrations-api
      • version-2.30/integrations-api
    • reference/integrations-api
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Tika"
3+
id: integrations-tika
4+
description: "Tika integration for Haystack"
5+
slug: "/integrations-tika"
6+
---
7+
8+
9+
## haystack_integrations.components.converters.tika.converter
10+
11+
### XHTMLParser
12+
13+
Bases: <code>HTMLParser</code>
14+
15+
Custom parser to extract pages from Tika XHTML content.
16+
17+
#### __init__
18+
19+
```python
20+
__init__() -> None
21+
```
22+
23+
Initialize the XHTMLParser.
24+
25+
#### handle_starttag
26+
27+
```python
28+
handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None
29+
```
30+
31+
Identify the start of a page div.
32+
33+
**Parameters:**
34+
35+
- **tag** (<code>str</code>) – The HTML tag name.
36+
- **attrs** (<code>list\[tuple\[str, str | None\]\]</code>) – The HTML tag attributes.
37+
38+
#### handle_endtag
39+
40+
```python
41+
handle_endtag(tag: str) -> None
42+
```
43+
44+
Identify the end of a page div.
45+
46+
**Parameters:**
47+
48+
- **tag** (<code>str</code>) – The HTML tag name.
49+
50+
#### handle_data
51+
52+
```python
53+
handle_data(data: str) -> None
54+
```
55+
56+
Populate the page content.
57+
58+
**Parameters:**
59+
60+
- **data** (<code>str</code>) – The text content of an HTML node.
61+
62+
### TikaDocumentConverter
63+
64+
Converts files of different types to Documents using Apache Tika.
65+
66+
This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
67+
requires a running Tika server.
68+
For more options on running Tika,
69+
see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
70+
71+
Usage example:
72+
73+
```python
74+
from haystack_integrations.components.converters.tika import TikaDocumentConverter
75+
from datetime import datetime
76+
77+
converter = TikaDocumentConverter()
78+
results = converter.run(
79+
sources=["sample.docx", "my_document.rtf", "archive.zip"],
80+
meta={"date_added": datetime.now().isoformat()}
81+
)
82+
documents = results["documents"]
83+
84+
print(documents[0].content)
85+
# >> 'This is a text from the docx file.'
86+
```
87+
88+
#### __init__
89+
90+
```python
91+
__init__(
92+
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
93+
) -> None
94+
```
95+
96+
Create a TikaDocumentConverter component.
97+
98+
**Parameters:**
99+
100+
- **tika_url** (<code>str</code>) – Tika server URL.
101+
- **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
102+
If False, only the file name is stored.
103+
104+
#### run
105+
106+
```python
107+
run(
108+
sources: list[str | Path | ByteStream],
109+
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
110+
) -> dict[str, list[Document]]
111+
```
112+
113+
Convert files to Documents.
114+
115+
**Parameters:**
116+
117+
- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
118+
- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
119+
This value can be either a list of dictionaries or a single dictionary.
120+
If it's a single dictionary, its content is added to the metadata of all produced Documents.
121+
If it's a list, the length of the list must match the number of sources, because the two lists will
122+
be zipped.
123+
If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
124+
125+
**Returns:**
126+
127+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
128+
- `documents`: Created Documents
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Tika"
3+
id: integrations-tika
4+
description: "Tika integration for Haystack"
5+
slug: "/integrations-tika"
6+
---
7+
8+
9+
## haystack_integrations.components.converters.tika.converter
10+
11+
### XHTMLParser
12+
13+
Bases: <code>HTMLParser</code>
14+
15+
Custom parser to extract pages from Tika XHTML content.
16+
17+
#### __init__
18+
19+
```python
20+
__init__() -> None
21+
```
22+
23+
Initialize the XHTMLParser.
24+
25+
#### handle_starttag
26+
27+
```python
28+
handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None
29+
```
30+
31+
Identify the start of a page div.
32+
33+
**Parameters:**
34+
35+
- **tag** (<code>str</code>) – The HTML tag name.
36+
- **attrs** (<code>list\[tuple\[str, str | None\]\]</code>) – The HTML tag attributes.
37+
38+
#### handle_endtag
39+
40+
```python
41+
handle_endtag(tag: str) -> None
42+
```
43+
44+
Identify the end of a page div.
45+
46+
**Parameters:**
47+
48+
- **tag** (<code>str</code>) – The HTML tag name.
49+
50+
#### handle_data
51+
52+
```python
53+
handle_data(data: str) -> None
54+
```
55+
56+
Populate the page content.
57+
58+
**Parameters:**
59+
60+
- **data** (<code>str</code>) – The text content of an HTML node.
61+
62+
### TikaDocumentConverter
63+
64+
Converts files of different types to Documents using Apache Tika.
65+
66+
This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
67+
requires a running Tika server.
68+
For more options on running Tika,
69+
see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
70+
71+
Usage example:
72+
73+
```python
74+
from haystack_integrations.components.converters.tika import TikaDocumentConverter
75+
from datetime import datetime
76+
77+
converter = TikaDocumentConverter()
78+
results = converter.run(
79+
sources=["sample.docx", "my_document.rtf", "archive.zip"],
80+
meta={"date_added": datetime.now().isoformat()}
81+
)
82+
documents = results["documents"]
83+
84+
print(documents[0].content)
85+
# >> 'This is a text from the docx file.'
86+
```
87+
88+
#### __init__
89+
90+
```python
91+
__init__(
92+
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
93+
) -> None
94+
```
95+
96+
Create a TikaDocumentConverter component.
97+
98+
**Parameters:**
99+
100+
- **tika_url** (<code>str</code>) – Tika server URL.
101+
- **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
102+
If False, only the file name is stored.
103+
104+
#### run
105+
106+
```python
107+
run(
108+
sources: list[str | Path | ByteStream],
109+
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
110+
) -> dict[str, list[Document]]
111+
```
112+
113+
Convert files to Documents.
114+
115+
**Parameters:**
116+
117+
- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
118+
- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
119+
This value can be either a list of dictionaries or a single dictionary.
120+
If it's a single dictionary, its content is added to the metadata of all produced Documents.
121+
If it's a list, the length of the list must match the number of sources, because the two lists will
122+
be zipped.
123+
If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
124+
125+
**Returns:**
126+
127+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
128+
- `documents`: Created Documents
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Tika"
3+
id: integrations-tika
4+
description: "Tika integration for Haystack"
5+
slug: "/integrations-tika"
6+
---
7+
8+
9+
## haystack_integrations.components.converters.tika.converter
10+
11+
### XHTMLParser
12+
13+
Bases: <code>HTMLParser</code>
14+
15+
Custom parser to extract pages from Tika XHTML content.
16+
17+
#### __init__
18+
19+
```python
20+
__init__() -> None
21+
```
22+
23+
Initialize the XHTMLParser.
24+
25+
#### handle_starttag
26+
27+
```python
28+
handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None
29+
```
30+
31+
Identify the start of a page div.
32+
33+
**Parameters:**
34+
35+
- **tag** (<code>str</code>) – The HTML tag name.
36+
- **attrs** (<code>list\[tuple\[str, str | None\]\]</code>) – The HTML tag attributes.
37+
38+
#### handle_endtag
39+
40+
```python
41+
handle_endtag(tag: str) -> None
42+
```
43+
44+
Identify the end of a page div.
45+
46+
**Parameters:**
47+
48+
- **tag** (<code>str</code>) – The HTML tag name.
49+
50+
#### handle_data
51+
52+
```python
53+
handle_data(data: str) -> None
54+
```
55+
56+
Populate the page content.
57+
58+
**Parameters:**
59+
60+
- **data** (<code>str</code>) – The text content of an HTML node.
61+
62+
### TikaDocumentConverter
63+
64+
Converts files of different types to Documents using Apache Tika.
65+
66+
This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
67+
requires a running Tika server.
68+
For more options on running Tika,
69+
see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
70+
71+
Usage example:
72+
73+
```python
74+
from haystack_integrations.components.converters.tika import TikaDocumentConverter
75+
from datetime import datetime
76+
77+
converter = TikaDocumentConverter()
78+
results = converter.run(
79+
sources=["sample.docx", "my_document.rtf", "archive.zip"],
80+
meta={"date_added": datetime.now().isoformat()}
81+
)
82+
documents = results["documents"]
83+
84+
print(documents[0].content)
85+
# >> 'This is a text from the docx file.'
86+
```
87+
88+
#### __init__
89+
90+
```python
91+
__init__(
92+
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
93+
) -> None
94+
```
95+
96+
Create a TikaDocumentConverter component.
97+
98+
**Parameters:**
99+
100+
- **tika_url** (<code>str</code>) – Tika server URL.
101+
- **store_full_path** (<code>bool</code>) – If True, the full path of the file is stored in the metadata of the document.
102+
If False, only the file name is stored.
103+
104+
#### run
105+
106+
```python
107+
run(
108+
sources: list[str | Path | ByteStream],
109+
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
110+
) -> dict[str, list[Document]]
111+
```
112+
113+
Convert files to Documents.
114+
115+
**Parameters:**
116+
117+
- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths or ByteStream objects.
118+
- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
119+
This value can be either a list of dictionaries or a single dictionary.
120+
If it's a single dictionary, its content is added to the metadata of all produced Documents.
121+
If it's a list, the length of the list must match the number of sources, because the two lists will
122+
be zipped.
123+
If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
124+
125+
**Returns:**
126+
127+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
128+
- `documents`: Created Documents

0 commit comments

Comments
 (0)