diff --git a/docs/document.rst b/docs/document.rst index 85b5e4c23..55f1c2746 100644 --- a/docs/document.rst +++ b/docs/document.rst @@ -176,15 +176,11 @@ For details on **embedded files** refer to Appendix 3. * If ``stream`` is given, then the document is created from memory. * If ``stream`` is `None`, then a document is created from the file given by ``filename``. - :arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text. + :arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is (almost [#f8]_) always determined from the file content. The ``filetype`` parameter can be used to force treating any file as plain text. For plain text files, there is no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter must be given. - :arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text. + :arg bytes,bytearray,BytesIO stream: A memory area containing file data. With few exceptions [#f8]_, the document type is detected from the data content. - :arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type`. - - If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text. - - When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text. + :arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:`a supported document type`. Text-based files usually have no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter (especially when opening from memory) must usually be given. :arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages. @@ -208,13 +204,12 @@ For details on **embedded files** refer to Appendix 3. >>> # from a file >>> doc = pymupdf.open("some.xps") - >>> # handle wrong extension - >>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type - >>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text + >>> # handle wrong / missing extension when required + >>> doc = pymupdf.open("some.file", filetype="mobi") # treat as MOBI e-book >>> >>> # from memory - >>> doc = pymupdf.open(stream=mem_area) # works for any supported type - >>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text + >>> doc = pymupdf.open(stream=mem_area) # works for most supported types + >>> doc = pymupdf.open(stream=ambiguous, filetype="mobi") # treat as MOBI e-book >>> >>> # new empty PDF >>> doc = pymupdf.open() @@ -2211,4 +2206,6 @@ Other Examples .. [#f7] This only works under certain conditions. For example, if there is normal text covered by some image on top of it, then this is undetectable and the respective text is **not** removed. Similar is true for white text on white background, and so on. +.. [#f8] Almost all supported document types -- including all images -- are detected by MuPDF's built-in content recognizer. Exceptions are many text-based formats like plain text, program source code, etc. which have no unambiguous way for content identification. The e-book formats MOBI (extension ``.mobi``) and FictionBook (extension ``.fb2``) are two other exceptions which will probably be covered by the recognition feature soon. In these cases, the respective file extensions **must** be present - or (especially when opening from memory) the ``filetype`` must specify the document type. + .. include:: footer.rst diff --git a/docs/how-to-open-a-file.rst b/docs/how-to-open-a-file.rst index 7cfb5012b..92aee2321 100644 --- a/docs/how-to-open-a-file.rst +++ b/docs/how-to-open-a-file.rst @@ -38,22 +38,19 @@ To open a file, do the following: File Recognizer: Opening with :index:`a Wrong File Extension ` """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer". +If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer" in the base library. This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension. Here is a list of details about how the file content recognizer works: -* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch. +* Whether opening from a file name or from memory, the recognizer in most cases will determine the correct document type. It does not need or even look at the file extension - which is not available anyway when opening from memory. -* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`. +* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Text" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`. -* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type. +* Currently, two e-book formats, FictionBook and MOBI, are not automatically recognized. They require the extensions ".fb2" and ".mobi" respectively. Use the ``filetype`` parameter accordingly to open them from memory. -* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified. - - * Streams with a known file type cannot be opened as plain text. - * Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text. +* Using `filetype="txt"` will treat **any** file as containing plain text -- even when its content is a supported document type. ---------- diff --git a/src/__init__.py b/src/__init__.py index fbcae3c44..d5d61c648 100644 --- a/src/__init__.py +++ b/src/__init__.py @@ -2922,8 +2922,6 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0 else: raise TypeError(f"bad stream: {type(stream)=}.") stream = self.stream - if not (filename or filetype): - filename = 'pdf' else: self.stream = None @@ -2951,6 +2949,17 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0 w = r.x1 - r.x0 h = r.y1 - r.y0 + if from_file: + _, magic2 = os.path.splitext(filename) + if magic2.startswith("."): + magic2 = magic2[1:] + else: + magic2 = "" + if isinstance(filetype, str): + magic = filetype + else: + magic = "" + if stream is not None: assert isinstance(stream, (bytes, memoryview)) if len(stream) == 0: @@ -2962,65 +2971,56 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0 buffer_ = mupdf.fz_new_buffer_from_copied_data(c) data = mupdf.fz_open_buffer(buffer_) else: - # Pass raw bytes data to mupdf.fz_open_memory(). This assumes - # that the bytes string will not be modified; i think the - # original PyMuPDF code makes the same assumption. Presumably - # setting self.stream above ensures that the bytes will not be - # garbage collected? data = mupdf.fz_open_memory(mupdf.python_buffer_data(c), len(c)) - magic = filename - if not magic: - magic = filetype - # fixme: pymupdf does: - # handler = fz_recognize_document(gctx, filetype); - # if (!handler) raise ValueError( MSG_BAD_FILETYPE) - # but prefer to leave fz_open_document_with_stream() to raise. + try: - doc = mupdf.fz_open_document_with_stream(magic, data) + if magic: + handler = mupdf.ll_fz_recognize_document(magic) + if not handler: + raise FileDataError("Failed to open stream as {magic}") + accel = mupdf.FzStream() + archive = mupdf.FzArchive(None) + doc = mupdf.ll_fz_document_handler_open( + handler, + data.m_internal, + accel.m_internal, + archive.m_internal, + None, # recognize_state + ) + doc = mupdf.FzDocument(doc) + else: + doc = mupdf.fz_open_document_with_stream(magic, data) except Exception as e: if g_exceptions_verbose > 1: exception_info() raise FileDataError('Failed to open stream') from e else: if filename: - if not filetype: + if magic == "txt": + handler = mupdf.ll_fz_recognize_document(magic) + else: + stream = mupdf.FzStream(filename) + handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic) + if not handler and magic2: + handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic2) + if handler: + #log( f'{handler.open=}') + #log( f'{dir(handler.open)=}') try: - doc = mupdf.fz_open_document(filename) + accel = mupdf.FzStream() + archive = mupdf.FzArchive(None) + doc = mupdf.ll_fz_document_handler_open( + handler, + stream.m_internal, + accel.m_internal, + archive.m_internal, + None, # recognize_state + ) except Exception as e: if g_exceptions_verbose > 1: exception_info() - raise FileDataError(f'Failed to open file {filename!r}.') from e + raise FileDataError(f'Failed to open file {filename!r}') from e + doc = mupdf.FzDocument(doc) else: - handler = mupdf.ll_fz_recognize_document(filetype) - if handler: - if handler.open: - #log( f'{handler.open=}') - #log( f'{dir(handler.open)=}') - try: - stream = mupdf.FzStream(filename) - accel = mupdf.FzStream() - archive = mupdf.FzArchive(None) - if mupdf_version_tuple >= (1, 24, 8): - doc = mupdf.ll_fz_document_handler_open( - handler, - stream.m_internal, - accel.m_internal, - archive.m_internal, - None, # recognize_state - ) - else: - doc = mupdf.ll_fz_document_open_fn_call( - handler.open, - stream.m_internal, - accel.m_internal, - archive.m_internal, - ) - except Exception as e: - if g_exceptions_verbose > 1: exception_info() - raise FileDataError(f'Failed to open file {filename!r} as type {filetype!r}.') from e - doc = mupdf.FzDocument( doc) - else: - assert 0 - else: - raise ValueError( MSG_BAD_FILETYPE) + raise ValueError(MSG_BAD_FILETYPE) else: pdf = mupdf.PdfDocument() doc = mupdf.FzDocument(pdf) diff --git a/tests/resources/fb2-file.fb2 b/tests/resources/fb2-file.fb2 new file mode 100644 index 000000000..ad5f56d70 --- /dev/null +++ b/tests/resources/fb2-file.fb2 @@ -0,0 +1,64 @@ + + + + + computers + + Chris + Clark + + Sample FB2 book + +

Short sample of a FictionBook2 book with simple metadata. Based on test_book.md from https://github.com/clach04/sample_reading_media

+
+ ebook,sample,markdown,fb2,FictionBook2 +
+ + + clach04 + https://github.com/clach04/sample_reading_media + + + vim and scite + https://github.com/clach04/sample_reading_media + 1.0 + +

Initial version, written by hand.

+
+
+
+ + + <p>This is a title</p> + + +
+ + <p>Test Header h1</p> + + +

A test paragraph.

+

Another test paragraph.

+
+ +
+ + <p>Another Test Header h1</p> + + +
+ + <p>A Test Header h2</p> + + +
+ + <p>A Test Header h3</p> + + +

Yet more copy

+
+
+
+ +
diff --git a/tests/resources/mobi-file.mobi b/tests/resources/mobi-file.mobi new file mode 100644 index 000000000..fe3d4689a Binary files /dev/null and b/tests/resources/mobi-file.mobi differ diff --git a/tests/resources/svg-file.svg b/tests/resources/svg-file.svg new file mode 100644 index 000000000..b07bf4f2a --- /dev/null +++ b/tests/resources/svg-file.svg @@ -0,0 +1,18 @@ + + + + + + + + + + + + + + + + + + diff --git a/tests/resources/xps-file.xps b/tests/resources/xps-file.xps new file mode 100644 index 000000000..05a2b4a75 Binary files /dev/null and b/tests/resources/xps-file.xps differ diff --git a/tests/test_general.py b/tests/test_general.py index a3236d389..ffe3c6f8a 100644 --- a/tests/test_general.py +++ b/tests/test_general.py @@ -100,8 +100,6 @@ def test_annot_clean_contents(): annot = page.add_highlight_annot((10, 10, 20, 20)) # the annotation appearance will not start with command b"q" - - # invoke appearance stream cleaning and reformatting annot.clean_contents() @@ -133,19 +131,10 @@ def test_pdfstring(): def test_open_exceptions(): - try: - pymupdf.open(filename, filetype="xps") - except RuntimeError as e: - assert repr(e).startswith("FileDataError") - else: - assert 0 + if pymupdf.mupdf_version_tuple < (1, 26): + print(f'Not testing open_exceptions because {pymupdf.mupdf_version=} < 1.26') + return - try: - pymupdf.open(filename, filetype="xxx") - except Exception as e: - assert repr(e).startswith("ValueError") - else: - assert 0 try: pymupdf.open("x.y") @@ -155,12 +144,53 @@ def test_open_exceptions(): assert 0 try: - pymupdf.open(stream=b"", filetype="pdf") + pymupdf.open(stream=b"") except RuntimeError as e: assert repr(e).startswith("EmptyFileError") else: assert 0 + doc = pymupdf.open(filename, filetype="txt") + assert doc.metadata["format"] == "Text" + + testfile = os.path.join(scriptdir, "resources", "Bezier.epub") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "EPUB" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes()) + assert doc.metadata["format"] == "EPUB" + + testfile = os.path.join(scriptdir, "resources", "svg-file.svg") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "SVG" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes()) + assert doc.metadata["format"] == "SVG" + + testfile = os.path.join(scriptdir, "resources", "xps-file.xps") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "XPS" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes()) + assert doc.metadata["format"] == "XPS" + + testfile = os.path.join(scriptdir, "resources", "nur-ruhig.jpg") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "Image" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes()) + assert doc.metadata["format"] == "Image" + + # FictionBook2 still requires filetype or correct extension! + testfile = os.path.join(scriptdir, "resources", "fb2-file.fb2") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "FictionBook2" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes(), filetype="fb2") + assert doc.metadata["format"] == "FictionBook2" + + # MOBI still requires filetype or correct extension! + testfile = os.path.join(scriptdir, "resources", "mobi-file.mobi") + doc = pymupdf.open(testfile) + assert doc.metadata["format"] == "MOBI" + doc = pymupdf.open(stream=pathlib.Path(testfile).read_bytes(), filetype="mobi") + assert doc.metadata["format"] == "MOBI" + def test_bug1945(): pdf = pymupdf.open(f'{scriptdir}/resources/bug1945.pdf') @@ -1380,21 +1410,6 @@ def check(filename=None, stream=None, filetype=None, exception=None): eregex = re.escape(f'Cannot open empty file: filename={path!r}.') check(path, exception=(etype, eregex)) - path = f'{resources}/1.pdf' - filetype = 'xps' - etype = pymupdf.FileDataError - # 2023-12-12: On OpenBSD, for some reason the SWIG catch code only catches - # the exception as FzErrorBase. - etype2 = 'FzErrorBase' if platform.system() == 'OpenBSD' else 'FzErrorFormat' - eregex = ( - # With a sysinstall with separate MuPDF install, we get - # `mupdf.FzErrorFormat` instead of `pymupdf.mupdf.FzErrorFormat`. So - # we just search for the former. - re.escape(f'mupdf.{etype2}: code=7: cannot recognize zip archive'), - re.escape(f'pymupdf.FileDataError: Failed to open file {path!r} as type {filetype!r}.'), - ) - check(path, filetype=filetype, exception=(etype, eregex)) - path = f'{resources}/chinese-tables.pickle' etype = pymupdf.FileDataError etype2 = 'FzErrorBase' if platform.system() == 'OpenBSD' else 'FzErrorUnsupported' @@ -1546,7 +1561,7 @@ def test_3859(): def test_3905(): data = b'A,B,C,D\r\n1,2,1,2\r\n2,2,1,2\r\n' try: - document = pymupdf.open(stream=data) + document = pymupdf.open(stream=data, filetype="pdf") except pymupdf.FileDataError as e: pass else: