Skip to content

feat: add Kreuzberg document converter integration#2927

Merged
anakin87 merged 10 commits intodeepset-ai:mainfrom
kreuzberg-dev:feat/kreuzberg-converter
Mar 20, 2026
Merged

feat: add Kreuzberg document converter integration#2927
anakin87 merged 10 commits intodeepset-ai:mainfrom
kreuzberg-dev:feat/kreuzberg-converter

Conversation

@v-tan
Copy link
Copy Markdown
Contributor

@v-tan v-tan commented Mar 6, 2026

Related Issues

Proposed Changes:

Adds a KreuzbergConverter component under integrations/kreuzberg/.

Component design:

  • Implements the @component protocol with sources and meta inputs, documents and raw_extraction outputs
  • Delegates extraction to kreuzberg's sync APIs (extract_file_sync, batch_extract_files_sync) — async pipeline support is deferred
  • ByteStream inputs are written to a temp file before extraction (kreuzberg requires file paths)
  • Batch mode (default) uses kreuzberg's Rust rayon thread pool for parallel extraction; can be disabled for single-file fallback
  • Three extraction modes: whole-document (default), per-page (PageConfig), and chunked (ChunkingConfig) — each produces different Document granularity and metadata shape

Serialization:

  • ExtractionConfig is serialized via kreuzberg's config_to_json / config_merge utilities
  • config_path stored as POSIX string for cross-platform compatibility
  • easyocr_kwargs serialized as plain dict
  • Round-trip tested via to_dict / from_dict

Metadata handling:

  • Kreuzberg's ExtractionResult fields are flattened into Document.meta
  • Overlap keys (quality_score, output_format, keywords) are handled from top-level fields to avoid duplication
  • User-provided meta is deep-copied to prevent mutation across documents
  • Format-specific metadata (e.g. PDF title, author, page count) is included when available

Files added:

integrations/kreuzberg/
├── src/haystack_integrations/components/converters/kreuzberg/
│   ├── __init__.py
│   └── converter.py          # ~800 lines (component + helpers)
├── tests/
│   ├── test_converter.py     # ~1200 lines (unit tests)
│   └── test_converter_integration.py  # ~600 lines (integration tests)
├── pyproject.toml
├── pydoc/config_docusaurus.yml
├── README.md
└── LICENSE.txt
.github/workflows/kreuzberg.yml   # CI: nightly + PR triggers
.github/labeler.yml               # label entry
README.md                         # inventory table entry

How did you test it?

  • hatch run test:unit — serialization round-trips, metadata extraction, error handling, edge cases (empty files, unsupported formats, blank pages), batch vs single-file modes
  • hatch run test:integration — end-to-end extraction with real PDF, DOCX, HTML, TXT fixtures; pipeline integration with DocumentCleaner and DocumentWriter
  • No external services required — kreuzberg processes locally. OCR tests require Tesseract installed.

Notes for the reviewer

  • converter.py is ~800 lines because it handles three extraction modes, metadata flattening from kreuzberg's rich result types, and serialization of kreuzberg's config objects — open to splitting if preferred
  • The raw_extraction output exposes serialized ExtractionResult for debugging without requiring kreuzberg as a dependency downstream
  • kreuzberg's error handling uses error codes (get_last_error_code, get_error_details) — the converter logs warnings and skips failed files rather than raising

Checklist

@v-tan v-tan requested a review from a team as a code owner March 6, 2026 12:14
@v-tan v-tan requested review from anakin87 and removed request for a team March 6, 2026 12:14
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 6, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Mar 6, 2026
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

I left some initial comments. Feel free to address them.

I'll continue with the review later/in the next few days.

Comment thread .github/workflows/kreuzberg.yml Outdated
Comment thread .github/workflows/kreuzberg.yml Outdated
Comment thread integrations/kreuzberg/README.md
Comment thread integrations/kreuzberg/kreuzberg.md Outdated
Comment thread integrations/kreuzberg/pyproject.toml Outdated
@v-tan
Copy link
Copy Markdown
Contributor Author

v-tan commented Mar 12, 2026

Thank you, Anakin87! I'll handle the comments and update asap!

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a deeper review and left some comments.

Feel free to simplify the implementation where possible (it's quite a bit of code 🙂).

Comment thread integrations/kreuzberg/README.md Outdated
Comment thread integrations/kreuzberg/uv.lock Outdated
Comment thread integrations/kreuzberg/tests/test_files/sample.html
Comment thread integrations/kreuzberg/tests/test_converter.py Outdated
@v-tan
Copy link
Copy Markdown
Contributor Author

v-tan commented Mar 16, 2026

Thank you for the review and your time, @anakin87 ! I understand the need for simplification.
I'll update the PR accordingly.

@v-tan v-tan force-pushed the feat/kreuzberg-converter branch from 17464d6 to 63dd2e8 Compare March 18, 2026 21:03
- Remove unnecessary defensive copy of data dict
- Remove intermediate init_params variable
- Access data["init_parameters"] directly since
  default_from_dict handles missing key errors
@v-tan
Copy link
Copy Markdown
Contributor Author

v-tan commented Mar 18, 2026

Hi!
I simplified the converter by:

  • removing raw_extraction_result export: I think users can serialize using kreuzberg type ExtractionResult directly.
  • simplified metadata serialization a bit by removing the explicit input of data from result into it. Removed duplicate puts too.
  • removed tables from metadata. we share tables for all formats within the content or result.metadata, depending on the output_format being requested.
  • unit tests have reduced from 79 to 42-43 I guess, without loosing coverage. Integration tests amount to the same number.
  • some methods were inlined into other methods in converter.py, without loosing readability.
  • utils got reduced and shifted to their own module.

- Save config JSON before from_dict mutates data in place
- Aligns test with in-place mutation pattern used by all
  Haystack converters
@v-tan v-tan requested a review from anakin87 March 18, 2026 21:33
Comment thread integrations/kreuzberg/tests/test_converter.py Outdated
@anakin87
Copy link
Copy Markdown
Member

@v-tan I left some final comments. Thank you for your work on this integration!

v-tan added 5 commits March 19, 2026 23:21
- Move FIXTURES_DIR constant to a `fixtures_dir` fixture in conftest.py
- Move _make_mock_result helper to a `make_mock_result` factory fixture
- Add default `converter` fixture for KreuzbergConverter()
- Update 19 tests to use converter fixture instead of inline construction
- Update 13 tests to use make_mock_result fixture instead of helper function
Kreuzberg's batch APIs return error results as ExtractionResult with
empty metadata and None quality_score instead of raising exceptions.
Previously these were silently passed through as valid Documents.

- Add _is_batch_error() to utils.py to detect error results
- Add _collect_batch_results() static method to filter errors with
  structured warning logs (classify_error + error_code_name)
- Fix LogRecord conflict: rename 'name' kwarg to 'code_name' in
  both batch and sequential error logging paths
- Add 3 unit tests for _is_batch_error detection logic
- Add 3 integration tests for corrupt file/bytestream filtering
@v-tan v-tan force-pushed the feat/kreuzberg-converter branch from df65ceb to 143e122 Compare March 19, 2026 19:51
Comment thread integrations/kreuzberg/pyproject.toml
Comment thread integrations/kreuzberg/pyproject.toml
Comment thread integrations/kreuzberg/pyproject.toml Outdated
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the work.

I tried the integration locally and runs fast!

@anakin87 anakin87 merged commit bf0810f into deepset-ai:main Mar 20, 2026
13 checks passed
@v-tan
Copy link
Copy Markdown
Contributor Author

v-tan commented Mar 20, 2026

Thank you! I enjoyed the review process as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants