feat: allow mime_type to be guessed for ByteStream#9573
feat: allow mime_type to be guessed for ByteStream#9573julian-risch merged 14 commits intodeepset-ai:mainfrom
Conversation
julian-risch
left a comment
There was a problem hiding this comment.
Hello @kanenorman Thank you for opening this pull request!
I left a few comments and also saw that three unit tests are failing here:
FAILED test/components/converters/test_multi_file_converter.py::TestMultiFileConverter::test_run_error_handling - haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'router'
Component type: 'FileTypeRouter'
Error: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/non_existent.txt'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_unlisted_extensions - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/audio/ignored.mp3'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_no_extension - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/txt/doc_2'
Please let me know if you need help looking into those. Other than need the pull request only needs a release note and we describe how you can generate one here. Please let me know in case you need help with that.
I was wondering why you wrote in the pull request description that it is a breaking change? With guess_mime_type being False as default and with FileTypeRouter using guess_mime_type True, the behavior remains the same, no?
Other than that the pull request looks quite good to me already! 🙂
|
|
||
| def get_bytestream_from_source(source: Union[str, Path, ByteStream]) -> ByteStream: | ||
| def get_bytestream_from_source( | ||
| source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False |
There was a problem hiding this comment.
Given that we have False as a default value, there is no reason to make this Optional. Let's change the type to bool
| source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False | |
| source: Union[str, Path, ByteStream], guess_mime_type: bool = False |
There was a problem hiding this comment.
and in from_file_path you also use bool instead of Optional[bool]
| normalize_metadata(({"a": 1},), sources_count=1) | ||
|
|
||
|
|
||
| def test_get_bytestream_from_path_object(tmp_path): |
There was a problem hiding this comment.
In addition to the test cases here, I suggest we add a test case for FileTypeRouter in test/components/routers/test_file_router.py checking that the parameter additional_mimetypes still works as expected. We're doing mimetypes.add_type(mime, ext) there and need to check that when get_bytestream_from_source it's taking into account these additional mimetypes, for example: {"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"}.
|
Thanks for the feedback @julian-risch. I’ll get the changes implemented shortly.
You’re absolutely right. I misspoke. Thanks for pointing that out! 😄 Regarding the failing unit test, I’ve identified the issue and will push a fix. Before proceeding, I reviewed the latest haystack-ai source code (v2.15.1) and noticed that Here’s a minimal example: python-version: 3.11.3 from haystack.components.routers import FileTypeRouter
router = FileTypeRouter(mime_types=[r'text/plain'])
# No meta - does not raise error
router.run(sources=["non_existent.txt"])
# → {'text/plain': [PosixPath('non_existent.txt')]}
# With meta - raises FileNotFoundError
router.run(sources=["non_existent.txt"], meta={"spam": "eggs"})
# → FileNotFoundError: [Errno 2] No such file or directory: 'non_existent.txt'This feels inconsistent. Should metadata change how we treat missing files? Should the component:
Happy to align with whatever behavior is intended, just wanted to clarify before committing a fix. |
|
Hi @kanenorman thanks for your patience! We discussed in the team and concluded that for now we would like to keep the inconsistency and not fix it in this PR. That makes this PR a bit more complex but doesn't break any current behavior. |
Pull Request Test Coverage Report for Build 16437756318Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
julian-risch
left a comment
There was a problem hiding this comment.
Looks good to me! 👍 Congratulations on your first pull request merged into Haystack!
The pull request now doesn't introduce any breaking changes, which is great. How you handled windows/unix mimetype differences in the test makes sense to me.
* feat(bytestream): add guess_mime_type parameter * refactor(FileTypeRouter): refactor guess mimetype * feat(bytestream): add guess_mime_type to util * style(ruff): add trailing whitespace * fix: fix type annotation * test(file_type_router): add test for additional_mimetypes param * fix(file_type_router): non-existent file behavior * feat(file_type_router): add release notes * fix(file_type_router): remove unused logger * style: fix ruff formatting magic values * test(bytestream): handle windows/unix mimetype differences --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Related Issues
Proposed Changes:
guess_mime_typeparameter toByteStreamFileTypeRouterguess_mime_typeparameter tohaystack.components.converters.utils.get_bytestream_from_sourceHow did you test it?
ByteStreamget_bytestream_from_sourceFileTypeRouterNotes for the reviewer
This is a breaking change so would like feedback 😄
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.