You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pymupdf4llm/api.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,7 +119,7 @@ The PyMuPDF4LLM API
119
119
* `(top, bottom)` yields `(0, top, 0, bottom)`.
120
120
* To always read full pages **(default)**, use `margins=0`.
121
121
122
-
:arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a O(n²) impact on processing time and memory requirements.
122
+
:arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a quadratic impact on processing time and memory requirements.
123
123
124
124
:arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function <pymupdf_layout_ocr_engines>`, specify it here. If omitted (`None`), one of the available built-in OCR engines will be used.
Copy file name to clipboardExpand all lines: docs/pymupdf4llm/ocr-plugins.rst
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,10 @@
1
+
.. include:: ../header.rst
2
+
3
+
1
4
Default OCR Functions
2
5
======================
3
6
4
-
PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the samne time also reducing the overall OCR processing time.
7
+
PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the same time also reducing the overall OCR processing time.
5
8
6
9
Here is an overview of the available default plugins:
7
10
@@ -15,7 +18,7 @@ rapidtess_api RapidOCR + Tesseract OCR Uses RapidOCR for text **detection** an
15
18
paddletess_api PaddleOCR + Tesseract OCR Uses PaddleOCR for text **detection** and Tesseract OCR for text **recognition**
If not explicitely selected via the `ocr_function` parameter, PyMuPDF4LLM will check the availability of the three OCR engines and pick one of the above plugins in the following order of preference:
21
+
If not explicitly selected via the `ocr_function` parameter, PyMuPDF4LLM will check the availability of the three OCR engines and pick one of the above plugins in the following order of preference:
19
22
20
23
1. `rapidtess_api` (if both RapidOCR and Tesseract OCR are available)
21
24
2. `paddletess_api` (if both PaddleOCR and Tesseract OCR are available)
@@ -38,7 +41,7 @@ The provided default plugins use the following **"hybrid"** OCR approach:
38
41
39
42
In this way, all original content (text and other elements) is preserved and only **augmented** with the newly recognized text. This allows for a more accurate and complete text extraction while also preserving the original document structure and formatting as much as possible. It also allows for a more efficient OCR processing since only the non-extractable text is processed by the OCR engine. This can significantly reduce the overall processing time.
40
43
41
-
It also increases the chances for a successful layout detection, because other original content like vectors remains intact and is not rendered to pixels.
44
+
It also increases the chances for a successful layout detection, because other original content like vectors remain intact and will not be rendered to pixels.
42
45
43
46
Forcing the Choice of a Default Plugin
44
47
---------------------------------------
@@ -69,7 +72,7 @@ If you want to use your own OCR function, you can do so as follows::
69
72
70
73
Your plugin must accept at least the ``page`` parameter which is a PyMuPDF Page object. The other parameters are optional. The plugin must create (or extend) the text of the passed-in page object by simply inserting text (using any of PyMuPDF's text insertion methods). No return values expected.
71
74
72
-
Be prepared to accept ``None`` or a PyMuPDF Pixmap object as the `pixmap` parameter, which is the rendered image of the page if provided. Parameters ``dpi`` and ``language`` are passed thru from the respective function parameters.
75
+
Be prepared to accept ``None`` or a PyMuPDF Pixmap object as the `pixmap` parameter, which is the rendered image of the page if provided. Parameters ``dpi`` and ``language`` are passed through from the respective function parameters.
73
76
74
77
75
78
Selecting Pages for OCR
@@ -110,7 +113,7 @@ The reason is one of the following values:
110
113
* "chars_bad": more than 10% of all characters are illegible (i.e. Replacement Unicode characters)
111
114
* "ocr_spans": there exist text spans created from previous OCR executions (render mode 3)
112
115
* "vec_text": there exist suspected vector-based glyphs
113
-
* "img_text": there exist images which probably contains recognizable text
116
+
* "img_text": there exist images which (probably) contain recognizable text
114
117
115
118
Based on this analysis, PyMuPDF4LLM will decide whether to invoke or skip OCR for a page. This is done to optimize processing time and resource usage by only performing OCR when it is likely to yield additional text content that cannot be extracted by other means.
116
119
@@ -129,7 +132,7 @@ You can override this logic in the following ways:
129
132
analysis = analyze_page(page)
130
133
if not analysis["needs_ocr"]:
131
134
return None
132
-
135
+
133
136
# if OCR is recommended, you can decide differently based on your own insights, e.g.
134
137
if analysis["reason"] == "ocr_spans":
135
138
# we might want to accept previous OCR:
@@ -140,3 +143,4 @@ You can override this logic in the following ways:
0 commit comments