Skip to content

Commit 40d5dbd

Browse files
committed
docs: document OCR support and required pymupdf.layout import for PyMuPDF4LLM
1 parent b49a9a0 commit 40d5dbd

2 files changed

Lines changed: 37 additions & 12 deletions

File tree

docs/images/layout-ocr-flow.png

2 Bytes
Loading

docs/pymupdf-layout/index.rst

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -138,28 +138,53 @@ Now we can happily load Office files and convert them as follows::
138138
OCR support
139139
~~~~~~~~~~~~~~~~~
140140

141-
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
142-
143-
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
141+
**Critical: Import pymupdf.layout FIRST**
142+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
144143

145-
If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
144+
.. code-block:: python
145+
:emphasize-lines: 1
146146
147-
For these heuristics to work we need both, an existing :ref:`Tesseract installation <installation_ocr>` and the availability of `OpenCV <https://pypi.org/project/opencv-python/>`_ in the Python environment. If either is missing, no OCR is attempted at all.
147+
import pymupdf.layout # REQUIRED FIRST - enables OCR decision tree
148+
import pymupdf4llm # Now OCR heuristics are active
148149
149-
The decision tree for whether OCR is actually used or not depends on the following:
150+
md_text = pymupdf4llm.to_markdown("scanned.pdf")
151+
# Auto: detects image pages → OCR → markdown
150152
151-
1. :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
153+
.. warning::
154+
**Without `import pymupdf.layout`, OCR is NEVER attempted** -
155+
even if Tesseract and OpenCV are installed.
152156

153-
2. In the :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have `use_ocr` enabled (this is set to `True` by default)
157+
**Complete Requirements** (all must be satisfied)
158+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
154159

155-
3. :ref:`Tesseract is correctly installed <installation_ocr>`
160+
.. list-table:: OCR Decision Prerequisites
161+
:widths: 15 85
162+
:header-rows: 1
156163

157-
4. `OpenCV <https://pypi.org/project/opencv-python/>`_ is available in your Python environment
164+
* - Check
165+
- Requirement
166+
* - 1. Layout
167+
- :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
168+
* - 2. OCR API
169+
- :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have ``use_ocr`` enabled (this is set to ``True`` by default)
170+
* - 3. Tesseract
171+
- :ref:`Tesseract OCR is correctly installed <installation_ocr>`
172+
* - 4. OpenCV
173+
- Available in the Python environment (``pip install opencv-python``)
158174

175+
**Smart OCR Heuristics** (Detailed)
176+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159177

160-
.. image:: ../images/layout-ocr-flow.png
178+
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
161179

162-
----
180+
If a page contains (roughly) **no text at all**, but is covered with **images or many character-sized vectors**, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart **image-based text** from ordinary pictures (like photographs).
181+
182+
If the page **does contain text** but **too many characters are unreadable** (like "�����"), OCR is also executed, but **for the affected text areas only** – not the full page. This way, we avoid losing already existing text and other content like images and vectors.
183+
184+
**OCR Decision Tree**
185+
^^^^^^^^^^^^^^^^^^^^
186+
187+
.. image:: ../images/layout-ocr-flow.png
163188

164189
.. _pymupdf_layout_and_pymupdf4llm_api:
165190

0 commit comments

Comments
 (0)