feat: add Kreuzberg document converter integration#2927
feat: add Kreuzberg document converter integration#2927anakin87 merged 10 commits intodeepset-ai:mainfrom
Conversation
anakin87
left a comment
There was a problem hiding this comment.
Thank you for the contribution!
I left some initial comments. Feel free to address them.
I'll continue with the review later/in the next few days.
|
Thank you, Anakin87! I'll handle the comments and update asap! |
anakin87
left a comment
There was a problem hiding this comment.
I did a deeper review and left some comments.
Feel free to simplify the implementation where possible (it's quite a bit of code 🙂).
|
Thank you for the review and your time, @anakin87 ! I understand the need for simplification. |
17464d6 to
63dd2e8
Compare
- Remove unnecessary defensive copy of data dict - Remove intermediate init_params variable - Access data["init_parameters"] directly since default_from_dict handles missing key errors
|
Hi!
|
- Save config JSON before from_dict mutates data in place - Aligns test with in-place mutation pattern used by all Haystack converters
|
@v-tan I left some final comments. Thank you for your work on this integration! |
- Move FIXTURES_DIR constant to a `fixtures_dir` fixture in conftest.py - Move _make_mock_result helper to a `make_mock_result` factory fixture - Add default `converter` fixture for KreuzbergConverter() - Update 19 tests to use converter fixture instead of inline construction - Update 13 tests to use make_mock_result fixture instead of helper function
Kreuzberg's batch APIs return error results as ExtractionResult with empty metadata and None quality_score instead of raising exceptions. Previously these were silently passed through as valid Documents. - Add _is_batch_error() to utils.py to detect error results - Add _collect_batch_results() static method to filter errors with structured warning logs (classify_error + error_code_name) - Fix LogRecord conflict: rename 'name' kwarg to 'code_name' in both batch and sequential error logging paths - Add 3 unit tests for _is_batch_error detection logic - Add 3 integration tests for corrupt file/bytestream filtering
df65ceb to
143e122
Compare
anakin87
left a comment
There was a problem hiding this comment.
Thank you for all the work.
I tried the integration locally and runs fast!
|
Thank you! I enjoyed the review process as well. |
Related Issues
Proposed Changes:
Adds a
KreuzbergConvertercomponent underintegrations/kreuzberg/.Component design:
@componentprotocol withsourcesandmetainputs,documentsandraw_extractionoutputsextract_file_sync,batch_extract_files_sync) — async pipeline support is deferredByteStreaminputs are written to a temp file before extraction (kreuzberg requires file paths)PageConfig), and chunked (ChunkingConfig) — each produces different Document granularity and metadata shapeSerialization:
ExtractionConfigis serialized via kreuzberg'sconfig_to_json/config_mergeutilitiesconfig_pathstored as POSIX string for cross-platform compatibilityeasyocr_kwargsserialized as plain dictto_dict/from_dictMetadata handling:
ExtractionResultfields are flattened intoDocument.metaquality_score,output_format,keywords) are handled from top-level fields to avoid duplicationmetais deep-copied to prevent mutation across documentsFiles added:
How did you test it?
hatch run test:unit— serialization round-trips, metadata extraction, error handling, edge cases (empty files, unsupported formats, blank pages), batch vs single-file modeshatch run test:integration— end-to-end extraction with real PDF, DOCX, HTML, TXT fixtures; pipeline integration withDocumentCleanerandDocumentWriterNotes for the reviewer
converter.pyis ~800 lines because it handles three extraction modes, metadata flattening from kreuzberg's rich result types, and serialization of kreuzberg's config objects — open to splitting if preferredraw_extractionoutput exposes serializedExtractionResultfor debugging without requiring kreuzberg as a dependency downstreamget_last_error_code,get_error_details) — the converter logs warnings and skips failed files rather than raisingChecklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.