Skip to content

Commit 4cc6a86

Browse files
Sync Core Integrations API reference (kreuzberg) on Docusaurus (#10886)
Co-authored-by: anakin87 <44616784+anakin87@users.noreply.github.com>
1 parent 0defe99 commit 4cc6a86

10 files changed

Lines changed: 1520 additions & 0 deletions

File tree

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
title: "Kreuzberg"
3+
id: integrations-kreuzberg
4+
description: "Kreuzberg integration for Haystack"
5+
slug: "/integrations-kreuzberg"
6+
---
7+
8+
9+
## haystack_integrations.components.converters.kreuzberg.converter
10+
11+
### KreuzbergConverter
12+
13+
Converts files to Documents using [Kreuzberg](https://docs.kreuzberg.dev/).
14+
15+
Kreuzberg is a document intelligence framework that extracts text from
16+
PDFs, Office documents, images, and 75+ other formats. All processing
17+
is performed locally with no external API calls.
18+
19+
**Usage Example:**
20+
21+
```python
22+
from haystack_integrations.components.converters.kreuzberg import (
23+
KreuzbergConverter,
24+
)
25+
26+
converter = KreuzbergConverter()
27+
result = converter.run(sources=["document.pdf", "report.docx"])
28+
documents = result["documents"]
29+
```
30+
31+
You can also pass kreuzberg's `ExtractionConfig` to customize extraction:
32+
33+
```python
34+
from kreuzberg import ExtractionConfig, OcrConfig
35+
36+
converter = KreuzbergConverter(
37+
config=ExtractionConfig(
38+
output_format="markdown",
39+
ocr=OcrConfig(backend="tesseract", language="eng"),
40+
),
41+
)
42+
```
43+
44+
**Token reduction** can be configured via
45+
`ExtractionConfig(token_reduction=TokenReductionConfig(mode="moderate"))`
46+
to reduce output size for LLM consumption. Five levels are available:
47+
`"off"`, `"light"`, `"moderate"`, `"aggressive"`, `"maximum"`.
48+
The reduced text appears directly in `Document.content`.
49+
50+
**Image preprocessing for OCR** can be tuned via
51+
`OcrConfig(tesseract_config=TesseractConfig(preprocessing=ImagePreprocessingConfig(...)))`
52+
with options for target DPI, auto-rotate, deskew, denoise,
53+
contrast enhancement, and binarization method.
54+
55+
#### __init__
56+
57+
```python
58+
__init__(
59+
*,
60+
config: ExtractionConfig | None = None,
61+
config_path: str | Path | None = None,
62+
store_full_path: bool = False,
63+
batch: bool = True,
64+
easyocr_kwargs: dict[str, Any] | None = None
65+
) -> None
66+
```
67+
68+
Create a `KreuzbergConverter` component.
69+
70+
**Parameters:**
71+
72+
- **config** (<code>ExtractionConfig | None</code>) – An optional `kreuzberg.ExtractionConfig` object to customize
73+
extraction behavior. Use this to set output format, OCR backend
74+
and language, force-OCR mode, per-page extraction, chunking,
75+
keyword extraction, and other kreuzberg options. If not provided,
76+
kreuzberg's defaults are used.
77+
See the [kreuzberg API reference](https://docs.kreuzberg.dev/reference/api-python/)
78+
for the full list of configuration options.
79+
- **config_path** (<code>str | Path | None</code>) – Path to a kreuzberg configuration file (`.toml`, `.yaml`, or
80+
`.json`). Cannot be used together with `config`.
81+
- **store_full_path** (<code>bool</code>) – If `True`, the full file path is stored in the Document metadata.
82+
If `False`, only the file name is stored.
83+
- **batch** (<code>bool</code>) – If `True`, use kreuzberg's batch extraction APIs, which leverage
84+
Rust's rayon thread pool for parallel processing. If `False`,
85+
sources are extracted one at a time.
86+
- **easyocr_kwargs** (<code>dict\[str, Any\] | None</code>) – Optional keyword arguments to pass to EasyOCR when using the
87+
`"easyocr"` backend. Supports GPU, beam width, model storage,
88+
and other EasyOCR-specific options.
89+
See the [EasyOCR documentation](https://www.jaided.ai/easyocr/documentation/)
90+
for the full list of supported arguments.
91+
92+
#### to_dict
93+
94+
```python
95+
to_dict() -> dict[str, Any]
96+
```
97+
98+
Serialize this component to a dictionary.
99+
100+
**Returns:**
101+
102+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
103+
104+
#### from_dict
105+
106+
```python
107+
from_dict(data: dict[str, Any]) -> KreuzbergConverter
108+
```
109+
110+
Deserialize this component from a dictionary.
111+
112+
**Parameters:**
113+
114+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
115+
116+
**Returns:**
117+
118+
- <code>KreuzbergConverter</code> – Deserialized component.
119+
120+
#### run
121+
122+
```python
123+
run(
124+
sources: list[str | Path | ByteStream],
125+
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
126+
) -> dict[str, list[Document]]
127+
```
128+
129+
Convert files to Documents using Kreuzberg.
130+
131+
**Parameters:**
132+
133+
- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths, directory paths, or ByteStream objects to
134+
convert. Directory paths are expanded to their direct file children
135+
(non-recursive, sorted alphabetically).
136+
- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
137+
This value can be either a list of dictionaries or a single
138+
dictionary. If it's a single dictionary, its content is added to
139+
the metadata of all produced Documents. If it's a list, the length
140+
of the list must match the number of sources, because the two
141+
lists will be zipped. If `sources` contains ByteStream objects,
142+
their `meta` will be added to the output Documents.
143+
144+
**Note:** When directories are present in `sources`, `meta` must
145+
be a single dictionary (not a list), since the number of files in
146+
a directory is not known in advance.
147+
148+
**Returns:**
149+
150+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
151+
152+
- `documents`: A list of created Documents.
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
title: "Kreuzberg"
3+
id: integrations-kreuzberg
4+
description: "Kreuzberg integration for Haystack"
5+
slug: "/integrations-kreuzberg"
6+
---
7+
8+
9+
## haystack_integrations.components.converters.kreuzberg.converter
10+
11+
### KreuzbergConverter
12+
13+
Converts files to Documents using [Kreuzberg](https://docs.kreuzberg.dev/).
14+
15+
Kreuzberg is a document intelligence framework that extracts text from
16+
PDFs, Office documents, images, and 75+ other formats. All processing
17+
is performed locally with no external API calls.
18+
19+
**Usage Example:**
20+
21+
```python
22+
from haystack_integrations.components.converters.kreuzberg import (
23+
KreuzbergConverter,
24+
)
25+
26+
converter = KreuzbergConverter()
27+
result = converter.run(sources=["document.pdf", "report.docx"])
28+
documents = result["documents"]
29+
```
30+
31+
You can also pass kreuzberg's `ExtractionConfig` to customize extraction:
32+
33+
```python
34+
from kreuzberg import ExtractionConfig, OcrConfig
35+
36+
converter = KreuzbergConverter(
37+
config=ExtractionConfig(
38+
output_format="markdown",
39+
ocr=OcrConfig(backend="tesseract", language="eng"),
40+
),
41+
)
42+
```
43+
44+
**Token reduction** can be configured via
45+
`ExtractionConfig(token_reduction=TokenReductionConfig(mode="moderate"))`
46+
to reduce output size for LLM consumption. Five levels are available:
47+
`"off"`, `"light"`, `"moderate"`, `"aggressive"`, `"maximum"`.
48+
The reduced text appears directly in `Document.content`.
49+
50+
**Image preprocessing for OCR** can be tuned via
51+
`OcrConfig(tesseract_config=TesseractConfig(preprocessing=ImagePreprocessingConfig(...)))`
52+
with options for target DPI, auto-rotate, deskew, denoise,
53+
contrast enhancement, and binarization method.
54+
55+
#### __init__
56+
57+
```python
58+
__init__(
59+
*,
60+
config: ExtractionConfig | None = None,
61+
config_path: str | Path | None = None,
62+
store_full_path: bool = False,
63+
batch: bool = True,
64+
easyocr_kwargs: dict[str, Any] | None = None
65+
) -> None
66+
```
67+
68+
Create a `KreuzbergConverter` component.
69+
70+
**Parameters:**
71+
72+
- **config** (<code>ExtractionConfig | None</code>) – An optional `kreuzberg.ExtractionConfig` object to customize
73+
extraction behavior. Use this to set output format, OCR backend
74+
and language, force-OCR mode, per-page extraction, chunking,
75+
keyword extraction, and other kreuzberg options. If not provided,
76+
kreuzberg's defaults are used.
77+
See the [kreuzberg API reference](https://docs.kreuzberg.dev/reference/api-python/)
78+
for the full list of configuration options.
79+
- **config_path** (<code>str | Path | None</code>) – Path to a kreuzberg configuration file (`.toml`, `.yaml`, or
80+
`.json`). Cannot be used together with `config`.
81+
- **store_full_path** (<code>bool</code>) – If `True`, the full file path is stored in the Document metadata.
82+
If `False`, only the file name is stored.
83+
- **batch** (<code>bool</code>) – If `True`, use kreuzberg's batch extraction APIs, which leverage
84+
Rust's rayon thread pool for parallel processing. If `False`,
85+
sources are extracted one at a time.
86+
- **easyocr_kwargs** (<code>dict\[str, Any\] | None</code>) – Optional keyword arguments to pass to EasyOCR when using the
87+
`"easyocr"` backend. Supports GPU, beam width, model storage,
88+
and other EasyOCR-specific options.
89+
See the [EasyOCR documentation](https://www.jaided.ai/easyocr/documentation/)
90+
for the full list of supported arguments.
91+
92+
#### to_dict
93+
94+
```python
95+
to_dict() -> dict[str, Any]
96+
```
97+
98+
Serialize this component to a dictionary.
99+
100+
**Returns:**
101+
102+
- <code>dict\[str, Any\]</code> – Dictionary with serialized data.
103+
104+
#### from_dict
105+
106+
```python
107+
from_dict(data: dict[str, Any]) -> KreuzbergConverter
108+
```
109+
110+
Deserialize this component from a dictionary.
111+
112+
**Parameters:**
113+
114+
- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
115+
116+
**Returns:**
117+
118+
- <code>KreuzbergConverter</code> – Deserialized component.
119+
120+
#### run
121+
122+
```python
123+
run(
124+
sources: list[str | Path | ByteStream],
125+
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
126+
) -> dict[str, list[Document]]
127+
```
128+
129+
Convert files to Documents using Kreuzberg.
130+
131+
**Parameters:**
132+
133+
- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths, directory paths, or ByteStream objects to
134+
convert. Directory paths are expanded to their direct file children
135+
(non-recursive, sorted alphabetically).
136+
- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents.
137+
This value can be either a list of dictionaries or a single
138+
dictionary. If it's a single dictionary, its content is added to
139+
the metadata of all produced Documents. If it's a list, the length
140+
of the list must match the number of sources, because the two
141+
lists will be zipped. If `sources` contains ByteStream objects,
142+
their `meta` will be added to the output Documents.
143+
144+
**Note:** When directories are present in `sources`, `meta` must
145+
be a single dictionary (not a list), since the number of files in
146+
a directory is not known in advance.
147+
148+
**Returns:**
149+
150+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key:
151+
152+
- `documents`: A list of created Documents.

0 commit comments

Comments
 (0)