Supporting MuPDF file recognizer by JorjMcKie · Pull Request #4481 · pymupdf/PyMuPDF

JorjMcKie · 2025-04-30T15:48:49Z

We previously used slightly different logic for opening documents from files versus from memory. This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible. This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.

julian-smith-artifex-com · 2025-05-01T09:54:28Z

+                        if not handler:
+                            raise FileDataError("Failed to open stream as {magic}")


Does this contradict the new docs which says:
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:a supported document type<Supported_File_Types>.
?
Shouldn't we should get a handler with fz_recognize_document_stream_content() here, so that the contents overrides the supplied magic where possible?

Or simply use fz_open_document_with_stream(), i.e. don't create an intermediate handler? I.e. remove the entire if magic: ... block in the code?

The recognizer seems to always detect the type except in 3 cases: "txt", "fb2", "mobi". This means we have identical behavior between file-open versus stream-open.
Only those three cases still require the correct extension, respectively filetype. Everything else disregards file extension, respectively needs no filetype-help.

The two cases "fb2" and "mobi" can be resolved with a little improved recognizer, which will hopefully happen soon.

The case "txt" remains an exception because of principle reasons.

This is the background of my new logic.

jamie-lemon

Docs side of things is fine. No typos - all good! :)

JorjMcKie requested review from jamie-lemon and julian-smith-artifex-com April 30, 2025 15:48

JorjMcKie force-pushed the support-mupdf-recognizer branch from 44500cd to 1fb4d0d Compare April 30, 2025 18:00

JorjMcKie force-pushed the support-mupdf-recognizer branch from 1fb4d0d to 2c90d6c Compare May 1, 2025 09:14

julian-smith-artifex-com reviewed May 1, 2025

View reviewed changes

jamie-lemon reviewed May 1, 2025

View reviewed changes

JorjMcKie closed this May 14, 2025

github-actions Bot locked and limited conversation to collaborators May 14, 2025

JorjMcKie deleted the support-mupdf-recognizer branch June 12, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting MuPDF file recognizer#4481

Supporting MuPDF file recognizer#4481
JorjMcKie wants to merge 1 commit into
mainfrom
support-mupdf-recognizer

JorjMcKie commented Apr 30, 2025

Uh oh!

julian-smith-artifex-com May 1, 2025

Uh oh!

JorjMcKie May 1, 2025

Uh oh!

jamie-lemon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if not handler:
		raise FileDataError("Failed to open stream as {magic}")

Conversation

JorjMcKie commented Apr 30, 2025

Uh oh!

julian-smith-artifex-com May 1, 2025

Choose a reason for hiding this comment

Uh oh!

JorjMcKie May 1, 2025

Choose a reason for hiding this comment

Uh oh!

jamie-lemon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants