|
| 1 | +--- |
| 2 | +title: "Kreuzberg" |
| 3 | +id: integrations-kreuzberg |
| 4 | +description: "Kreuzberg integration for Haystack" |
| 5 | +slug: "/integrations-kreuzberg" |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +## haystack_integrations.components.converters.kreuzberg.converter |
| 10 | + |
| 11 | +### KreuzbergConverter |
| 12 | + |
| 13 | +Converts files to Documents using [Kreuzberg](https://docs.kreuzberg.dev/). |
| 14 | + |
| 15 | +Kreuzberg is a document intelligence framework that extracts text from |
| 16 | +PDFs, Office documents, images, and 75+ other formats. All processing |
| 17 | +is performed locally with no external API calls. |
| 18 | + |
| 19 | +**Usage Example:** |
| 20 | + |
| 21 | +```python |
| 22 | +from haystack_integrations.components.converters.kreuzberg import ( |
| 23 | + KreuzbergConverter, |
| 24 | +) |
| 25 | + |
| 26 | +converter = KreuzbergConverter() |
| 27 | +result = converter.run(sources=["document.pdf", "report.docx"]) |
| 28 | +documents = result["documents"] |
| 29 | +``` |
| 30 | + |
| 31 | +You can also pass kreuzberg's `ExtractionConfig` to customize extraction: |
| 32 | + |
| 33 | +```python |
| 34 | +from kreuzberg import ExtractionConfig, OcrConfig |
| 35 | + |
| 36 | +converter = KreuzbergConverter( |
| 37 | + config=ExtractionConfig( |
| 38 | + output_format="markdown", |
| 39 | + ocr=OcrConfig(backend="tesseract", language="eng"), |
| 40 | + ), |
| 41 | +) |
| 42 | +``` |
| 43 | + |
| 44 | +**Token reduction** can be configured via |
| 45 | +`ExtractionConfig(token_reduction=TokenReductionConfig(mode="moderate"))` |
| 46 | +to reduce output size for LLM consumption. Five levels are available: |
| 47 | +`"off"`, `"light"`, `"moderate"`, `"aggressive"`, `"maximum"`. |
| 48 | +The reduced text appears directly in `Document.content`. |
| 49 | + |
| 50 | +**Image preprocessing for OCR** can be tuned via |
| 51 | +`OcrConfig(tesseract_config=TesseractConfig(preprocessing=ImagePreprocessingConfig(...)))` |
| 52 | +with options for target DPI, auto-rotate, deskew, denoise, |
| 53 | +contrast enhancement, and binarization method. |
| 54 | + |
| 55 | +#### __init__ |
| 56 | + |
| 57 | +```python |
| 58 | +__init__( |
| 59 | + *, |
| 60 | + config: ExtractionConfig | None = None, |
| 61 | + config_path: str | Path | None = None, |
| 62 | + store_full_path: bool = False, |
| 63 | + batch: bool = True, |
| 64 | + easyocr_kwargs: dict[str, Any] | None = None |
| 65 | +) -> None |
| 66 | +``` |
| 67 | + |
| 68 | +Create a `KreuzbergConverter` component. |
| 69 | + |
| 70 | +**Parameters:** |
| 71 | + |
| 72 | +- **config** (<code>ExtractionConfig | None</code>) – An optional `kreuzberg.ExtractionConfig` object to customize |
| 73 | + extraction behavior. Use this to set output format, OCR backend |
| 74 | + and language, force-OCR mode, per-page extraction, chunking, |
| 75 | + keyword extraction, and other kreuzberg options. If not provided, |
| 76 | + kreuzberg's defaults are used. |
| 77 | + See the [kreuzberg API reference](https://docs.kreuzberg.dev/reference/api-python/) |
| 78 | + for the full list of configuration options. |
| 79 | +- **config_path** (<code>str | Path | None</code>) – Path to a kreuzberg configuration file (`.toml`, `.yaml`, or |
| 80 | + `.json`). Cannot be used together with `config`. |
| 81 | +- **store_full_path** (<code>bool</code>) – If `True`, the full file path is stored in the Document metadata. |
| 82 | + If `False`, only the file name is stored. |
| 83 | +- **batch** (<code>bool</code>) – If `True`, use kreuzberg's batch extraction APIs, which leverage |
| 84 | + Rust's rayon thread pool for parallel processing. If `False`, |
| 85 | + sources are extracted one at a time. |
| 86 | +- **easyocr_kwargs** (<code>dict\[str, Any\] | None</code>) – Optional keyword arguments to pass to EasyOCR when using the |
| 87 | + `"easyocr"` backend. Supports GPU, beam width, model storage, |
| 88 | + and other EasyOCR-specific options. |
| 89 | + See the [EasyOCR documentation](https://www.jaided.ai/easyocr/documentation/) |
| 90 | + for the full list of supported arguments. |
| 91 | + |
| 92 | +#### to_dict |
| 93 | + |
| 94 | +```python |
| 95 | +to_dict() -> dict[str, Any] |
| 96 | +``` |
| 97 | + |
| 98 | +Serialize this component to a dictionary. |
| 99 | + |
| 100 | +**Returns:** |
| 101 | + |
| 102 | +- <code>dict\[str, Any\]</code> – Dictionary with serialized data. |
| 103 | + |
| 104 | +#### from_dict |
| 105 | + |
| 106 | +```python |
| 107 | +from_dict(data: dict[str, Any]) -> KreuzbergConverter |
| 108 | +``` |
| 109 | + |
| 110 | +Deserialize this component from a dictionary. |
| 111 | + |
| 112 | +**Parameters:** |
| 113 | + |
| 114 | +- **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. |
| 115 | + |
| 116 | +**Returns:** |
| 117 | + |
| 118 | +- <code>KreuzbergConverter</code> – Deserialized component. |
| 119 | + |
| 120 | +#### run |
| 121 | + |
| 122 | +```python |
| 123 | +run( |
| 124 | + sources: list[str | Path | ByteStream], |
| 125 | + meta: dict[str, Any] | list[dict[str, Any]] | None = None, |
| 126 | +) -> dict[str, list[Document]] |
| 127 | +``` |
| 128 | + |
| 129 | +Convert files to Documents using Kreuzberg. |
| 130 | + |
| 131 | +**Parameters:** |
| 132 | + |
| 133 | +- **sources** (<code>list\[str | Path | ByteStream\]</code>) – List of file paths, directory paths, or ByteStream objects to |
| 134 | + convert. Directory paths are expanded to their direct file children |
| 135 | + (non-recursive, sorted alphabetically). |
| 136 | +- **meta** (<code>dict\[str, Any\] | list\[dict\[str, Any\]\] | None</code>) – Optional metadata to attach to the Documents. |
| 137 | + This value can be either a list of dictionaries or a single |
| 138 | + dictionary. If it's a single dictionary, its content is added to |
| 139 | + the metadata of all produced Documents. If it's a list, the length |
| 140 | + of the list must match the number of sources, because the two |
| 141 | + lists will be zipped. If `sources` contains ByteStream objects, |
| 142 | + their `meta` will be added to the output Documents. |
| 143 | + |
| 144 | +**Note:** When directories are present in `sources`, `meta` must |
| 145 | +be a single dictionary (not a list), since the number of files in |
| 146 | +a directory is not known in advance. |
| 147 | + |
| 148 | +**Returns:** |
| 149 | + |
| 150 | +- <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following key: |
| 151 | + |
| 152 | +- `documents`: A list of created Documents. |
0 commit comments