Skip to content

Supporting MuPDF file recognizer#4481

Closed
JorjMcKie wants to merge 1 commit into
mainfrom
support-mupdf-recognizer
Closed

Supporting MuPDF file recognizer#4481
JorjMcKie wants to merge 1 commit into
mainfrom
support-mupdf-recognizer

Conversation

@JorjMcKie
Copy link
Copy Markdown
Collaborator

We previously used slightly different logic for opening documents from files versus from memory. This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible. This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.

We previously used slightly different logic for opening documents from files versus from memory.
This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible.
This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.
@JorjMcKie JorjMcKie force-pushed the support-mupdf-recognizer branch from 1fb4d0d to 2c90d6c Compare May 1, 2025 09:14
Comment thread src/__init__.py
Comment on lines +2979 to +2980
if not handler:
raise FileDataError("Failed to open stream as {magic}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this contradict the new docs which says:
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:a supported document type<Supported_File_Types>.
?
Shouldn't we should get a handler with fz_recognize_document_stream_content() here, so that the contents overrides the supplied magic where possible?

Or simply use fz_open_document_with_stream(), i.e. don't create an intermediate handler? I.e. remove the entire if magic: ... block in the code?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recognizer seems to always detect the type except in 3 cases: "txt", "fb2", "mobi". This means we have identical behavior between file-open versus stream-open.
Only those three cases still require the correct extension, respectively filetype. Everything else disregards file extension, respectively needs no filetype-help.

The two cases "fb2" and "mobi" can be resolved with a little improved recognizer, which will hopefully happen soon.

The case "txt" remains an exception because of principle reasons.

This is the background of my new logic.

Copy link
Copy Markdown
Collaborator

@jamie-lemon jamie-lemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs side of things is fine. No typos - all good! :)

@JorjMcKie JorjMcKie closed this May 14, 2025
@github-actions github-actions Bot locked and limited conversation to collaborators May 14, 2025
@JorjMcKie JorjMcKie deleted the support-mupdf-recognizer branch June 12, 2025 11:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants