Skip to content

Commit 1ad1493

Browse files
committed
fix: address PR review feedback from anakin87
- Replace Datadog nightly failure notification with Slack via deepset-ai/notify-slack-action - Remove virtualenv version pin from hatch install (fixed upstream) - Trim README.md to match standard integration format (chroma) - Remove kreuzberg.md (auto-generated on release) - Remove "unit" pytest marker (not used in this repo)
1 parent 35459d2 commit 1ad1493

4 files changed

Lines changed: 9 additions & 402 deletions

File tree

.github/workflows/kreuzberg.yml

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ jobs:
4747
python-version: ${{ matrix.python-version }}
4848

4949
- name: Install Hatch
50-
run: pip install hatch "virtualenv<21.0.0"
50+
run: pip install hatch
5151

5252
- name: Lint
5353
if: matrix.python-version == '3.10' && runner.os == 'Linux'
@@ -69,10 +69,12 @@ jobs:
6969
hatch -e test env run -- uv pip install git+https://github.com/deepset-ai/haystack.git@main
7070
hatch run test:unit
7171
72-
- name: Send event to Datadog for nightly failures
73-
if: failure() && github.event_name == 'schedule'
74-
uses: ./.github/actions/send_failure
72+
73+
notify-slack-on-failure:
74+
needs: run
75+
if: failure() && github.event_name == 'schedule'
76+
runs-on: ubuntu-slim
77+
steps:
78+
- uses: deepset-ai/notify-slack-action@v1
7579
with:
76-
title: |
77-
Core integrations nightly tests failure: ${{ github.workflow }}
78-
api-key: ${{ secrets.CORE_DATADOG_API_KEY }}
80+
slack-webhook-url: ${{ secrets.SLACK_WEBHOOK_URL_NOTIFICATIONS }}

integrations/kreuzberg/README.md

Lines changed: 0 additions & 221 deletions
Original file line numberDiff line numberDiff line change
@@ -7,229 +7,8 @@
77

88
---
99

10-
## Overview
11-
12-
A [Haystack](https://haystack.deepset.ai/) integration for [Kreuzberg](https://docs.kreuzberg.dev/), a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other file formats. All processing is performed locally — no external API calls.
13-
14-
## Installation
15-
16-
```console
17-
pip install kreuzberg-haystack
18-
```
19-
20-
Kreuzberg requires system OCR libraries for image-based extraction. See the [Kreuzberg installation docs](https://docs.kreuzberg.dev/) for platform-specific setup (e.g. Tesseract, EasyOCR).
21-
22-
## Quick Start
23-
24-
```python
25-
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
26-
27-
converter = KreuzbergConverter()
28-
result = converter.run(sources=["document.pdf", "report.docx"])
29-
30-
for doc in result["documents"]:
31-
print(doc.content[:200])
32-
```
33-
34-
## Features
35-
36-
- **75+ file formats** — PDF, DOCX, PPTX, XLSX, HTML, images, email, ebooks, and more
37-
- **Local processing** — No external API calls; fully offline capable
38-
- **Batch extraction** — Parallel processing via Rust rayon thread pool
39-
- **Per-page splitting** — One Document per page for fine-grained retrieval
40-
- **Built-in chunking** — Server-side chunking via kreuzberg's `ChunkingConfig`
41-
- **Multiple output formats** — Plain text, Markdown, or HTML
42-
- **OCR backends** — Tesseract and EasyOCR with configurable language and preprocessing
43-
- **Token reduction** — Reduce output size for LLM consumption (5 levels)
44-
- **Rich metadata** — Tables, images, annotations, keywords, quality scores, detected languages
45-
- **Configuration files** — Load settings from TOML, YAML, or JSON files
46-
47-
## Usage Examples
48-
49-
### Basic Conversion
50-
51-
```python
52-
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
53-
54-
converter = KreuzbergConverter()
55-
result = converter.run(sources=["report.pdf"])
56-
documents = result["documents"]
57-
```
58-
59-
### Markdown Output with OCR
60-
61-
```python
62-
from kreuzberg import ExtractionConfig, OcrConfig
63-
64-
converter = KreuzbergConverter(
65-
config=ExtractionConfig(
66-
output_format="markdown",
67-
ocr=OcrConfig(backend="tesseract", language="eng"),
68-
),
69-
)
70-
result = converter.run(sources=["scanned_document.pdf"])
71-
```
72-
73-
### Per-Page Extraction
74-
75-
```python
76-
from kreuzberg import ExtractionConfig, PageConfig
77-
78-
converter = KreuzbergConverter(
79-
config=ExtractionConfig(
80-
pages=PageConfig(extract_pages=True),
81-
),
82-
)
83-
result = converter.run(sources=["multi_page.pdf"])
84-
# One Document per page, with page_number in metadata
85-
```
86-
87-
### Token Reduction for LLMs
88-
89-
```python
90-
from kreuzberg import ExtractionConfig, TokenReductionConfig
91-
92-
converter = KreuzbergConverter(
93-
config=ExtractionConfig(
94-
token_reduction=TokenReductionConfig(mode="moderate"),
95-
),
96-
)
97-
```
98-
99-
### Directory Input
100-
101-
```python
102-
converter = KreuzbergConverter()
103-
# Expands to all files in the directory (non-recursive, sorted)
104-
result = converter.run(sources=["./documents/"])
105-
```
106-
107-
### In a Haystack Pipeline
108-
109-
```python
110-
from haystack import Pipeline
111-
from haystack.components.preprocessors import DocumentCleaner
112-
from haystack.components.writers import DocumentWriter
113-
from haystack.document_stores.in_memory import InMemoryDocumentStore
114-
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
115-
116-
document_store = InMemoryDocumentStore()
117-
118-
pipeline = Pipeline()
119-
pipeline.add_component("converter", KreuzbergConverter())
120-
pipeline.add_component("cleaner", DocumentCleaner())
121-
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
122-
123-
pipeline.connect("converter.documents", "cleaner")
124-
pipeline.connect("cleaner", "writer")
125-
126-
pipeline.run({"converter": {"sources": ["report.pdf", "notes.docx"]}})
127-
```
128-
129-
### Configuration File
130-
131-
```python
132-
converter = KreuzbergConverter(config_path="kreuzberg.toml")
133-
```
134-
135-
Where `kreuzberg.toml` might contain:
136-
137-
```toml
138-
output_format = "markdown"
139-
140-
[ocr]
141-
backend = "tesseract"
142-
language = "eng+deu"
143-
```
144-
145-
## API Reference
146-
147-
### `KreuzbergConverter.__init__`
148-
149-
| Parameter | Type | Default | Description |
150-
|---|---|---|---|
151-
| `config` | `ExtractionConfig \| None` | `None` | Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more. |
152-
| `config_path` | `str \| Path \| None` | `None` | Path to a kreuzberg config file (TOML/YAML/JSON). When both `config` and `config_path` are given, `config` takes precedence. |
153-
| `store_full_path` | `bool` | `False` | If `True`, store full file paths in metadata. If `False`, store only the file name. |
154-
| `batch` | `bool` | `True` | Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool. |
155-
| `easyocr_kwargs` | `dict \| None` | `None` | Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.). |
156-
157-
### `KreuzbergConverter.run`
158-
159-
| Parameter | Type | Default | Description |
160-
|---|---|---|---|
161-
| `sources` | `list[str \| Path \| ByteStream]` | *(required)* | File paths, directory paths, or ByteStream objects to convert. |
162-
| `meta` | `dict \| list[dict] \| None` | `None` | Metadata to attach to Documents. A single dict applies to all; a list is zipped with sources. |
163-
164-
**Returns** a dict with:
165-
- `documents``list[Document]`: Converted documents with content and metadata.
166-
- `raw_extraction``list[dict]`: Serialized kreuzberg `ExtractionResult` for each source (useful for debugging).
167-
168-
## Metadata Fields
169-
170-
Each Document's `meta` dict may include the following fields (depending on source format and configuration):
171-
172-
| Field | Type | Description |
173-
|---|---|---|
174-
| `file_path` | `str` | Source file name (or full path if `store_full_path=True`) |
175-
| `mime_type` | `str` | Detected MIME type of the source |
176-
| `file_extensions` | `list[str]` | Known extensions for the MIME type |
177-
| `output_format` | `str` | Output format used (plain/markdown/html) |
178-
| `result_format` | `str` | Result format from kreuzberg |
179-
| `quality_score` | `float` | Extraction quality score (0.0–1.0) |
180-
| `detected_languages` | `list[str]` | Languages detected in the content |
181-
| `extracted_keywords` | `list[dict]` | Keywords with text, score, and algorithm |
182-
| `table_count` | `int` | Number of tables extracted |
183-
| `tables` | `list[dict]` | Table data (cells, markdown, page_number) |
184-
| `image_count` | `int` | Number of images found |
185-
| `images` | `list[dict]` | Image metadata (format, dimensions, page, description) |
186-
| `annotations` | `list[dict]` | PDF annotations (type, content, page_number) |
187-
| `processing_warnings` | `list[dict]` | Warnings from extraction (source, message) |
188-
| `page_number` | `int` | Page number (per-page mode only) |
189-
| `is_blank` | `bool` | Whether the page is blank (per-page mode only) |
190-
| `chunk_index` | `int` | Chunk index (chunking mode only) |
191-
| `total_chunks` | `int` | Total chunks (chunking mode only) |
192-
193-
Format-specific metadata from kreuzberg (e.g. PDF title, author, page count) is also flattened into `meta`.
194-
195-
## Supported Formats
196-
197-
Kreuzberg supports 75+ file formats. You can query the available extractors at runtime:
198-
199-
```python
200-
KreuzbergConverter.supported_extractors()
201-
KreuzbergConverter.supported_ocr_backends()
202-
```
203-
204-
Common supported formats include:
205-
206-
| Category | Formats |
207-
|---|---|
208-
| Documents | PDF, DOCX, DOC, ODT, RTF, EPUB |
209-
| Spreadsheets | XLSX, XLS, ODS, CSV, TSV |
210-
| Presentations | PPTX, PPT, ODP |
211-
| Images | PNG, JPEG, TIFF, BMP, WebP (via OCR) |
212-
| Web | HTML, XHTML, XML, Markdown |
213-
| Email | EML, MSG |
214-
| Code | Plain text, source code files |
215-
| Archives | Extracts from contained documents |
216-
21710
## Contributing
21811

21912
Refer to the general [Contribution Guidelines](https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md).
22013

221-
To run tests locally:
222-
223-
```console
224-
# Install the integration in development mode
225-
pip install -e ".[dev]"
226-
227-
# Run unit tests
228-
pytest tests/
229-
```
230-
23114
No external services or API keys are required — kreuzberg processes everything locally. For OCR tests, ensure Tesseract is installed on your system.
232-
233-
## License
234-
235-
`kreuzberg-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)