|
7 | 7 |
|
8 | 8 | --- |
9 | 9 |
|
10 | | -## Overview |
11 | | - |
12 | | -A [Haystack](https://haystack.deepset.ai/) integration for [Kreuzberg](https://docs.kreuzberg.dev/), a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other file formats. All processing is performed locally — no external API calls. |
13 | | - |
14 | | -## Installation |
15 | | - |
16 | | -```console |
17 | | -pip install kreuzberg-haystack |
18 | | -``` |
19 | | - |
20 | | -Kreuzberg requires system OCR libraries for image-based extraction. See the [Kreuzberg installation docs](https://docs.kreuzberg.dev/) for platform-specific setup (e.g. Tesseract, EasyOCR). |
21 | | - |
22 | | -## Quick Start |
23 | | - |
24 | | -```python |
25 | | -from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter |
26 | | - |
27 | | -converter = KreuzbergConverter() |
28 | | -result = converter.run(sources=["document.pdf", "report.docx"]) |
29 | | - |
30 | | -for doc in result["documents"]: |
31 | | - print(doc.content[:200]) |
32 | | -``` |
33 | | - |
34 | | -## Features |
35 | | - |
36 | | -- **75+ file formats** — PDF, DOCX, PPTX, XLSX, HTML, images, email, ebooks, and more |
37 | | -- **Local processing** — No external API calls; fully offline capable |
38 | | -- **Batch extraction** — Parallel processing via Rust rayon thread pool |
39 | | -- **Per-page splitting** — One Document per page for fine-grained retrieval |
40 | | -- **Built-in chunking** — Server-side chunking via kreuzberg's `ChunkingConfig` |
41 | | -- **Multiple output formats** — Plain text, Markdown, or HTML |
42 | | -- **OCR backends** — Tesseract and EasyOCR with configurable language and preprocessing |
43 | | -- **Token reduction** — Reduce output size for LLM consumption (5 levels) |
44 | | -- **Rich metadata** — Tables, images, annotations, keywords, quality scores, detected languages |
45 | | -- **Configuration files** — Load settings from TOML, YAML, or JSON files |
46 | | - |
47 | | -## Usage Examples |
48 | | - |
49 | | -### Basic Conversion |
50 | | - |
51 | | -```python |
52 | | -from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter |
53 | | - |
54 | | -converter = KreuzbergConverter() |
55 | | -result = converter.run(sources=["report.pdf"]) |
56 | | -documents = result["documents"] |
57 | | -``` |
58 | | - |
59 | | -### Markdown Output with OCR |
60 | | - |
61 | | -```python |
62 | | -from kreuzberg import ExtractionConfig, OcrConfig |
63 | | - |
64 | | -converter = KreuzbergConverter( |
65 | | - config=ExtractionConfig( |
66 | | - output_format="markdown", |
67 | | - ocr=OcrConfig(backend="tesseract", language="eng"), |
68 | | - ), |
69 | | -) |
70 | | -result = converter.run(sources=["scanned_document.pdf"]) |
71 | | -``` |
72 | | - |
73 | | -### Per-Page Extraction |
74 | | - |
75 | | -```python |
76 | | -from kreuzberg import ExtractionConfig, PageConfig |
77 | | - |
78 | | -converter = KreuzbergConverter( |
79 | | - config=ExtractionConfig( |
80 | | - pages=PageConfig(extract_pages=True), |
81 | | - ), |
82 | | -) |
83 | | -result = converter.run(sources=["multi_page.pdf"]) |
84 | | -# One Document per page, with page_number in metadata |
85 | | -``` |
86 | | - |
87 | | -### Token Reduction for LLMs |
88 | | - |
89 | | -```python |
90 | | -from kreuzberg import ExtractionConfig, TokenReductionConfig |
91 | | - |
92 | | -converter = KreuzbergConverter( |
93 | | - config=ExtractionConfig( |
94 | | - token_reduction=TokenReductionConfig(mode="moderate"), |
95 | | - ), |
96 | | -) |
97 | | -``` |
98 | | - |
99 | | -### Directory Input |
100 | | - |
101 | | -```python |
102 | | -converter = KreuzbergConverter() |
103 | | -# Expands to all files in the directory (non-recursive, sorted) |
104 | | -result = converter.run(sources=["./documents/"]) |
105 | | -``` |
106 | | - |
107 | | -### In a Haystack Pipeline |
108 | | - |
109 | | -```python |
110 | | -from haystack import Pipeline |
111 | | -from haystack.components.preprocessors import DocumentCleaner |
112 | | -from haystack.components.writers import DocumentWriter |
113 | | -from haystack.document_stores.in_memory import InMemoryDocumentStore |
114 | | -from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter |
115 | | - |
116 | | -document_store = InMemoryDocumentStore() |
117 | | - |
118 | | -pipeline = Pipeline() |
119 | | -pipeline.add_component("converter", KreuzbergConverter()) |
120 | | -pipeline.add_component("cleaner", DocumentCleaner()) |
121 | | -pipeline.add_component("writer", DocumentWriter(document_store=document_store)) |
122 | | - |
123 | | -pipeline.connect("converter.documents", "cleaner") |
124 | | -pipeline.connect("cleaner", "writer") |
125 | | - |
126 | | -pipeline.run({"converter": {"sources": ["report.pdf", "notes.docx"]}}) |
127 | | -``` |
128 | | - |
129 | | -### Configuration File |
130 | | - |
131 | | -```python |
132 | | -converter = KreuzbergConverter(config_path="kreuzberg.toml") |
133 | | -``` |
134 | | - |
135 | | -Where `kreuzberg.toml` might contain: |
136 | | - |
137 | | -```toml |
138 | | -output_format = "markdown" |
139 | | - |
140 | | -[ocr] |
141 | | -backend = "tesseract" |
142 | | -language = "eng+deu" |
143 | | -``` |
144 | | - |
145 | | -## API Reference |
146 | | - |
147 | | -### `KreuzbergConverter.__init__` |
148 | | - |
149 | | -| Parameter | Type | Default | Description | |
150 | | -|---|---|---|---| |
151 | | -| `config` | `ExtractionConfig \| None` | `None` | Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more. | |
152 | | -| `config_path` | `str \| Path \| None` | `None` | Path to a kreuzberg config file (TOML/YAML/JSON). When both `config` and `config_path` are given, `config` takes precedence. | |
153 | | -| `store_full_path` | `bool` | `False` | If `True`, store full file paths in metadata. If `False`, store only the file name. | |
154 | | -| `batch` | `bool` | `True` | Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool. | |
155 | | -| `easyocr_kwargs` | `dict \| None` | `None` | Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.). | |
156 | | - |
157 | | -### `KreuzbergConverter.run` |
158 | | - |
159 | | -| Parameter | Type | Default | Description | |
160 | | -|---|---|---|---| |
161 | | -| `sources` | `list[str \| Path \| ByteStream]` | *(required)* | File paths, directory paths, or ByteStream objects to convert. | |
162 | | -| `meta` | `dict \| list[dict] \| None` | `None` | Metadata to attach to Documents. A single dict applies to all; a list is zipped with sources. | |
163 | | - |
164 | | -**Returns** a dict with: |
165 | | -- `documents` — `list[Document]`: Converted documents with content and metadata. |
166 | | -- `raw_extraction` — `list[dict]`: Serialized kreuzberg `ExtractionResult` for each source (useful for debugging). |
167 | | - |
168 | | -## Metadata Fields |
169 | | - |
170 | | -Each Document's `meta` dict may include the following fields (depending on source format and configuration): |
171 | | - |
172 | | -| Field | Type | Description | |
173 | | -|---|---|---| |
174 | | -| `file_path` | `str` | Source file name (or full path if `store_full_path=True`) | |
175 | | -| `mime_type` | `str` | Detected MIME type of the source | |
176 | | -| `file_extensions` | `list[str]` | Known extensions for the MIME type | |
177 | | -| `output_format` | `str` | Output format used (plain/markdown/html) | |
178 | | -| `result_format` | `str` | Result format from kreuzberg | |
179 | | -| `quality_score` | `float` | Extraction quality score (0.0–1.0) | |
180 | | -| `detected_languages` | `list[str]` | Languages detected in the content | |
181 | | -| `extracted_keywords` | `list[dict]` | Keywords with text, score, and algorithm | |
182 | | -| `table_count` | `int` | Number of tables extracted | |
183 | | -| `tables` | `list[dict]` | Table data (cells, markdown, page_number) | |
184 | | -| `image_count` | `int` | Number of images found | |
185 | | -| `images` | `list[dict]` | Image metadata (format, dimensions, page, description) | |
186 | | -| `annotations` | `list[dict]` | PDF annotations (type, content, page_number) | |
187 | | -| `processing_warnings` | `list[dict]` | Warnings from extraction (source, message) | |
188 | | -| `page_number` | `int` | Page number (per-page mode only) | |
189 | | -| `is_blank` | `bool` | Whether the page is blank (per-page mode only) | |
190 | | -| `chunk_index` | `int` | Chunk index (chunking mode only) | |
191 | | -| `total_chunks` | `int` | Total chunks (chunking mode only) | |
192 | | - |
193 | | -Format-specific metadata from kreuzberg (e.g. PDF title, author, page count) is also flattened into `meta`. |
194 | | - |
195 | | -## Supported Formats |
196 | | - |
197 | | -Kreuzberg supports 75+ file formats. You can query the available extractors at runtime: |
198 | | - |
199 | | -```python |
200 | | -KreuzbergConverter.supported_extractors() |
201 | | -KreuzbergConverter.supported_ocr_backends() |
202 | | -``` |
203 | | - |
204 | | -Common supported formats include: |
205 | | - |
206 | | -| Category | Formats | |
207 | | -|---|---| |
208 | | -| Documents | PDF, DOCX, DOC, ODT, RTF, EPUB | |
209 | | -| Spreadsheets | XLSX, XLS, ODS, CSV, TSV | |
210 | | -| Presentations | PPTX, PPT, ODP | |
211 | | -| Images | PNG, JPEG, TIFF, BMP, WebP (via OCR) | |
212 | | -| Web | HTML, XHTML, XML, Markdown | |
213 | | -| Email | EML, MSG | |
214 | | -| Code | Plain text, source code files | |
215 | | -| Archives | Extracts from contained documents | |
216 | | - |
217 | 10 | ## Contributing |
218 | 11 |
|
219 | 12 | Refer to the general [Contribution Guidelines](https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md). |
220 | 13 |
|
221 | | -To run tests locally: |
222 | | - |
223 | | -```console |
224 | | -# Install the integration in development mode |
225 | | -pip install -e ".[dev]" |
226 | | - |
227 | | -# Run unit tests |
228 | | -pytest tests/ |
229 | | -``` |
230 | | - |
231 | 14 | No external services or API keys are required — kreuzberg processes everything locally. For OCR tests, ensure Tesseract is installed on your system. |
232 | | - |
233 | | -## License |
234 | | - |
235 | | -`kreuzberg-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments