feat: add Kreuzberg document converter integration by v-tan · Pull Request #2927 · deepset-ai/haystack-core-integrations

v-tan · 2026-03-06T12:14:16Z

Related Issues

part of Add Kreuzberg document converter integration #2926

Proposed Changes:

Adds a KreuzbergConverter component under integrations/kreuzberg/.

Component design:

Implements the @component protocol with sources and meta inputs, documents and raw_extraction outputs
Delegates extraction to kreuzberg's sync APIs (extract_file_sync, batch_extract_files_sync) — async pipeline support is deferred
ByteStream inputs are written to a temp file before extraction (kreuzberg requires file paths)
Batch mode (default) uses kreuzberg's Rust rayon thread pool for parallel extraction; can be disabled for single-file fallback
Three extraction modes: whole-document (default), per-page (PageConfig), and chunked (ChunkingConfig) — each produces different Document granularity and metadata shape

Serialization:

ExtractionConfig is serialized via kreuzberg's config_to_json / config_merge utilities
config_path stored as POSIX string for cross-platform compatibility
easyocr_kwargs serialized as plain dict
Round-trip tested via to_dict / from_dict

Metadata handling:

Kreuzberg's ExtractionResult fields are flattened into Document.meta
Overlap keys (quality_score, output_format, keywords) are handled from top-level fields to avoid duplication
User-provided meta is deep-copied to prevent mutation across documents
Format-specific metadata (e.g. PDF title, author, page count) is included when available

Files added:

integrations/kreuzberg/
├── src/haystack_integrations/components/converters/kreuzberg/
│   ├── __init__.py
│   └── converter.py          # ~800 lines (component + helpers)
├── tests/
│   ├── test_converter.py     # ~1200 lines (unit tests)
│   └── test_converter_integration.py  # ~600 lines (integration tests)
├── pyproject.toml
├── pydoc/config_docusaurus.yml
├── README.md
└── LICENSE.txt
.github/workflows/kreuzberg.yml   # CI: nightly + PR triggers
.github/labeler.yml               # label entry
README.md                         # inventory table entry

How did you test it?

hatch run test:unit — serialization round-trips, metadata extraction, error handling, edge cases (empty files, unsupported formats, blank pages), batch vs single-file modes
hatch run test:integration — end-to-end extraction with real PDF, DOCX, HTML, TXT fixtures; pipeline integration with DocumentCleaner and DocumentWriter
No external services required — kreuzberg processes locally. OCR tests require Tesseract installed.

Notes for the reviewer

converter.py is ~800 lines because it handles three extraction modes, metadata flattening from kreuzberg's rich result types, and serialization of kreuzberg's config objects — open to splitting if preferred
The raw_extraction output exposes serialized ExtractionResult for debugging without requiring kreuzberg as a dependency downstream
kreuzberg's error handling uses error codes (get_last_error_code, get_error_details) — the converter logs warnings and skips failed files rather than raising

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

CLAassistant · 2026-03-06T12:14:24Z

All committers have signed the CLA.

anakin87

Thank you for the contribution!

I left some initial comments. Feel free to address them.

I'll continue with the review later/in the next few days.

v-tan · 2026-03-12T08:50:55Z

Thank you, Anakin87! I'll handle the comments and update asap!

anakin87

I did a deeper review and left some comments.

Feel free to simplify the implementation where possible (it's quite a bit of code 🙂).

v-tan · 2026-03-16T09:24:05Z

Thank you for the review and your time, @anakin87 ! I understand the need for simplification.
I'll update the PR accordingly.

- Remove unnecessary defensive copy of data dict - Remove intermediate init_params variable - Access data["init_parameters"] directly since default_from_dict handles missing key errors

v-tan · 2026-03-18T21:27:50Z

Hi!
I simplified the converter by:

removing raw_extraction_result export: I think users can serialize using kreuzberg type ExtractionResult directly.
simplified metadata serialization a bit by removing the explicit input of data from result into it. Removed duplicate puts too.
removed tables from metadata. we share tables for all formats within the content or result.metadata, depending on the output_format being requested.
unit tests have reduced from 79 to 42-43 I guess, without loosing coverage. Integration tests amount to the same number.
some methods were inlined into other methods in converter.py, without loosing readability.
utils got reduced and shifted to their own module.

- Save config JSON before from_dict mutates data in place - Aligns test with in-place mutation pattern used by all Haystack converters

anakin87 · 2026-03-19T10:30:14Z

@v-tan I left some final comments. Thank you for your work on this integration!

- Move FIXTURES_DIR constant to a `fixtures_dir` fixture in conftest.py - Move _make_mock_result helper to a `make_mock_result` factory fixture - Add default `converter` fixture for KreuzbergConverter() - Update 19 tests to use converter fixture instead of inline construction - Update 13 tests to use make_mock_result fixture instead of helper function

Kreuzberg's batch APIs return error results as ExtractionResult with empty metadata and None quality_score instead of raising exceptions. Previously these were silently passed through as valid Documents. - Add _is_batch_error() to utils.py to detect error results - Add _collect_batch_results() static method to filter errors with structured warning logs (classify_error + error_code_name) - Fix LogRecord conflict: rename 'name' kwarg to 'code_name' in both batch and sequential error logging paths - Add 3 unit tests for _is_batch_error detection logic - Add 3 integration tests for corrupt file/bytestream filtering

anakin87

Thank you for all the work.

I tried the integration locally and runs fast!

v-tan · 2026-03-20T11:31:42Z

Thank you! I enjoyed the review process as well.

v-tan requested a review from a team as a code owner March 6, 2026 12:14

v-tan requested review from anakin87 and removed request for a team March 6, 2026 12:14

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Mar 6, 2026

v-tan mentioned this pull request Mar 6, 2026

Add Kreuzberg document converter integration #2926

Closed

anakin87 requested changes Mar 10, 2026

View reviewed changes

Comment thread .github/workflows/kreuzberg.yml Outdated

Comment thread .github/workflows/kreuzberg.yml Outdated

Comment thread integrations/kreuzberg/README.md

Comment thread integrations/kreuzberg/kreuzberg.md Outdated

Comment thread integrations/kreuzberg/pyproject.toml Outdated

anakin87 requested changes Mar 13, 2026

View reviewed changes

anakin87 reviewed Mar 17, 2026

View reviewed changes

Comment thread integrations/kreuzberg/src/haystack_integrations/components/converters/kreuzberg/converter.py Outdated

feat: add Kreuzberg document converter integration

63dd2e8

v-tan force-pushed the feat/kreuzberg-converter branch from 17464d6 to 63dd2e8 Compare March 18, 2026 21:03

refactor: simplify from_dict() deserialization logic

5e7ea71

- Remove unnecessary defensive copy of data dict - Remove intermediate init_params variable - Access data["init_parameters"] directly since default_from_dict handles missing key errors

test: fix double-roundtrip assertion in serialization test

8143b2c

- Save config JSON before from_dict mutates data in place - Aligns test with in-place mutation pattern used by all Haystack converters

v-tan requested a review from anakin87 March 18, 2026 21:33