Skip to content

Commit 1a9aef7

Browse files
authored
Add Kreuzberg integration page (#425)
* Add Kreuzberg integration page * Add Kreuzberg logo
1 parent 4333403 commit 1a9aef7

2 files changed

Lines changed: 118 additions & 0 deletions

File tree

integrations/kreuzberg.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
layout: integration
3+
name: Kreuzberg
4+
description: Locally convert 91+ document formats into Haystack Documents using Kreuzberg's Rust-core engine
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
pypi: https://pypi.org/project/kreuzberg-haystack
12+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/kreuzberg
13+
type: Data Ingestion
14+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
15+
logo: /logos/kreuzberg.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### **Table of Contents**
21+
- [Overview](#overview)
22+
- [Installation](#installation)
23+
- [Usage](#usage)
24+
- [Additional Features](#additional-features)
25+
- [License](#license)
26+
27+
## Overview
28+
29+
[Kreuzberg](https://docs.kreuzberg.dev/) is a document intelligence framework with a Rust core that extracts text, tables, and metadata from 91+ file formats — entirely locally with no external API calls.
30+
31+
This integration provides `KreuzbergConverter`, a Haystack component that converts files into Haystack `Document` objects with rich metadata. It supports parallel batch extraction using Rust's rayon thread pool for high throughput.
32+
33+
**Supported format categories:**
34+
- **Documents**: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODS, ODP, RTF, Pages, Keynote, Numbers, and more
35+
- **Images (via OCR)**: PNG, JPEG, TIFF, GIF, BMP, WebP, JPEG 2000, SVG
36+
- **Text/Markup**: Markdown, HTML, XML, LaTeX, Typst, JSON, YAML, reStructuredText, Jupyter notebooks
37+
- **Email**: EML, MSG (with attachment extraction)
38+
- **Archives**: ZIP, TAR, GZIP, 7Z (extracts and processes contents recursively)
39+
- **eBooks & Academic**: EPUB, BibTeX, DocBook, JATS
40+
41+
## Installation
42+
43+
```bash
44+
pip install kreuzberg-haystack
45+
```
46+
47+
## Usage
48+
49+
### Components
50+
51+
This integration introduces one component:
52+
53+
- **`KreuzbergConverter`**: Converts files and directories into Haystack `Document` objects. Accepts file paths, directory paths, and `ByteStream` objects as input.
54+
55+
### Basic Usage
56+
57+
```python
58+
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
59+
60+
converter = KreuzbergConverter()
61+
result = converter.run(sources=["report.pdf", "notes.docx"])
62+
documents = result["documents"]
63+
```
64+
65+
### Markdown Output with OCR
66+
67+
Use `ExtractionConfig` to customize output format, OCR backend, and other extraction settings:
68+
69+
```python
70+
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
71+
from kreuzberg import ExtractionConfig, OcrConfig
72+
73+
converter = KreuzbergConverter(
74+
config=ExtractionConfig(
75+
output_format="markdown",
76+
ocr=OcrConfig(backend="tesseract", language="eng"),
77+
),
78+
)
79+
result = converter.run(sources=["scanned_document.pdf"])
80+
documents = result["documents"]
81+
```
82+
83+
### In a Pipeline
84+
85+
```python
86+
from haystack import Pipeline
87+
from haystack.components.preprocessors import DocumentSplitter
88+
from haystack.components.writers import DocumentWriter
89+
from haystack.document_stores.in_memory import InMemoryDocumentStore
90+
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
91+
92+
document_store = InMemoryDocumentStore()
93+
94+
pipeline = Pipeline()
95+
pipeline.add_component("converter", KreuzbergConverter())
96+
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
97+
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
98+
99+
pipeline.connect("converter", "splitter")
100+
pipeline.connect("splitter", "writer")
101+
102+
pipeline.run({"converter": {"sources": ["report.pdf", "presentation.pptx"]}})
103+
```
104+
105+
## Additional Features
106+
107+
- **Per-page extraction**: Create one `Document` per page using `PageConfig(extract_pages=True)`
108+
- **Chunking**: Split documents by token count with configurable overlap via `ChunkingConfig`
109+
- **Token reduction**: Reduce token count with modes from `"light"` to `"maximum"` via `TokenReductionConfig`
110+
- **Rich metadata**: Quality scores, detected languages, extracted keywords, table data, and PDF annotations
111+
- **Batch processing**: Parallel extraction enabled by default; set `batch=False` for sequential mode
112+
- **Config from file**: Load extraction settings from a TOML, YAML, or JSON file via `config_path`
113+
114+
For the full configuration reference and format support matrix, see the [Kreuzberg documentation](https://docs.kreuzberg.dev/).
115+
116+
### License
117+
118+
`kreuzberg-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

logos/kreuzberg.png

354 KB
Loading

0 commit comments

Comments
 (0)