|
32 | 32 |
|
33 | 33 | ## Why PyMuPDF? |
34 | 34 |
|
35 | | -- **Fast** — powered by [MuPDF](https://mupdf.com/) , a best-in-class C rendering engine |
| 35 | +- **Fast** — powered by [MuPDF](https://mupdf.com/), a best-in-class C rendering engine |
36 | 36 | - **Accurate** — pixel-perfect text extraction with font, color, and position metadata |
37 | 37 | - **Versatile** — read, write, annotate, redact, merge, split, and convert documents |
38 | 38 | - **LLM-ready** — native Markdown output via [PyMuPDF4LLM](https://pypi.org/project/pymupdf4llm/) for RAG and AI pipelines |
@@ -513,13 +513,13 @@ for rect in locations: |
513 | 513 |
|
514 | 514 | ### `get_images` shows no images but I can clearly see charts in the PDF. Why? |
515 | 515 |
|
516 | | -Charts and diagrams created by tools like matplotlib, Excel, or R are typically rendered as vector graphics (PDF drawing commands), not raster images. `get_images ` only lists embedded raster image objects and will not detect vector graphics. To capture these, rasterise the entire page with `page.get_pixmap()`. |
| 516 | +Charts and diagrams created by tools like matplotlib, Excel, or R are typically rendered as vector graphics (PDF drawing commands), not raster images. `get_images` only lists embedded raster image objects and will not detect vector graphics. To capture these, rasterise the entire page with `page.get_pixmap()`. |
517 | 517 |
|
518 | 518 |
|
519 | 519 |
|
520 | 520 | ### How does OCR work in PyMuPDF? Does it require a separate Tesseract installation? |
521 | 521 |
|
522 | | -PyMuPDF uses Tesseract for OCR, but Tesseract's C++ code is compiled directly into MuPDF — it is not called as an external subprocess. The only external requirement is the **Tesseract language data files** (`tessdata`). Over 100 languages are supported. There is no Python-level pytesseract dependency. |
| 522 | +PyMuPDF uses MuPDF's built-in Tesseract-based OCR support, so there is no Python-level `pytesseract` dependency. However, PyMuPDF still needs access to the **Tesseract language data files** (`tessdata`), and automatic tessdata discovery may invoke the `tesseract` executable (for example, to list available languages) if you do not explicitly provide a tessdata path. In practice, the recommended setup is to either install Tesseract so discovery works automatically, or configure the tessdata location yourself via the `tessdata` parameter or the `TESSDATA_PREFIX` environment variable. Over 100 languages are supported. |
523 | 523 |
|
524 | 524 | ```python |
525 | 525 | import pymupdf |
@@ -740,7 +740,7 @@ Full installation guide, API reference, cookbook, and tutorial at **[pymupdf.rea |
740 | 740 |
|
741 | 741 | | Project | Description | |
742 | 742 | |---|---| |
743 | | -| [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm) | TLLM/RAG-optimised Markdown and JSON extraction | |
| 743 | +| [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm) | LLM/RAG-optimised Markdown and JSON extraction | |
744 | 744 | | [PyMuPDF Pro](https://pymupdf.io/pro) | Adds Office and HWP document support | |
745 | 745 | | [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) | Extended font collection for PyMuPDF text output | |
746 | 746 |
|
|
0 commit comments