You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pymupdf-layout/index.rst
+37-12Lines changed: 37 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -138,28 +138,53 @@ Now we can happily load Office files and convert them as follows::
138
138
OCR support
139
139
~~~~~~~~~~~~~~~~~
140
140
141
-
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
142
-
143
-
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
141
+
**Critical: Import pymupdf.layout FIRST**
142
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
144
143
145
-
If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors.
144
+
.. code-block:: python
145
+
:emphasize-lines: 1
146
146
147
-
For these heuristics to work we need both, an existing :ref:`Tesseract installation <installation_ocr>` and the availability of `OpenCV <https://pypi.org/project/opencv-python/>`_ in the Python environment. If either is missing, no OCR is attempted at all.
147
+
import pymupdf.layout # REQUIRED FIRST - enables OCR decision tree
148
+
import pymupdf4llm # Now OCR heuristics are active
148
149
149
-
The decision tree for whether OCR is actually used or not depends on the following:
150
+
md_text = pymupdf4llm.to_markdown("scanned.pdf")
151
+
# Auto: detects image pages → OCR → markdown
150
152
151
-
1. :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
153
+
.. warning::
154
+
**Without `import pymupdf.layout`, OCR is NEVER attempted** -
155
+
even if Tesseract and OpenCV are installed.
152
156
153
-
2. In the :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have `use_ocr` enabled (this is set to `True` by default)
157
+
**Complete Requirements** (all must be satisfied)
158
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
154
159
155
-
3. :ref:`Tesseract is correctly installed <installation_ocr>`
160
+
.. list-table:: OCR Decision Prerequisites
161
+
:widths: 15 85
162
+
:header-rows: 1
156
163
157
-
4. `OpenCV <https://pypi.org/project/opencv-python/>`_ is available in your Python environment
164
+
* - Check
165
+
- Requirement
166
+
* - 1. Layout
167
+
- :ref:`PyMuPDF Layout is imported <pymupdf_layout_using>`
168
+
* - 2. OCR API
169
+
- :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` you have ``use_ocr`` enabled (this is set to ``True`` by default)
170
+
* - 3. Tesseract
171
+
- :ref:`Tesseract OCR is correctly installed <installation_ocr>`
172
+
* - 4. OpenCV
173
+
- Available in the Python environment (``pip install opencv-python``)
158
174
175
+
**Smart OCR Heuristics** (Detailed)
176
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159
177
160
-
.. image:: ../images/layout-ocr-flow.png
178
+
The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
161
179
162
-
----
180
+
If a page contains (roughly) **no text at all**, but is covered with **images or many character-sized vectors**, a check is made using `OpenCV <https://pypi.org/project/opencv-python/>`_ whether text is *probably* detectable on the page at all. This is done to tell apart **image-based text** from ordinary pictures (like photographs).
181
+
182
+
If the page **does contain text** but **too many characters are unreadable** (like "�����"), OCR is also executed, but **for the affected text areas only** – not the full page. This way, we avoid losing already existing text and other content like images and vectors.
0 commit comments