Skip to content

Commit 11e26ac

Browse files
committed
pymupdf4llm updates
1 parent 8cdc0e2 commit 11e26ac

2 files changed

Lines changed: 70 additions & 4 deletions

File tree

docs/pymupdf4llm/api.rst

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ The |PyMuPDF4LLM| API
4444
detect_bg_color: bool = True, \
4545
dpi: int = 150, \
4646
use_ocr: bool = True, \
47+
ocr_language: str = "eng", \
4748
ocr_dpi: int = 400, \
4849
embed_images: bool = False, \
4950
extract_words: bool = False, \
@@ -82,6 +83,8 @@ The |PyMuPDF4LLM| API
8283

8384
:arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability <pymupdf_layout_ocr_support>` to help analyse the page.
8485

86+
:arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German.
87+
8588
:arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high.
8689

8790
:arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text.
@@ -128,7 +131,7 @@ The |PyMuPDF4LLM| API
128131

129132
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierarchy level, `title` a string and `pagenumber` as a 1-based page number.
130133

131-
- **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page.
134+
- **"tables"** - |PyMuPDFLayoutMode_EmptyList| a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page.
132135

133136
- **"images"** - |PyMuPDFLayoutMode_EmptyList| a list of images on the page. This a copy of page method :meth:`Page.get_image_info`.
134137

@@ -138,6 +141,17 @@ The |PyMuPDF4LLM| API
138141

139142
- **"words"** - |PyMuPDFLayoutMode_EmptyList| if `extract_words=True` was used. This is a list of tuples `(x0, y0, x1, y1, "wordstring", bno, lno, wno)` as delivered by `page.get_text("words")`. The **sequence** of these tuples however is the same as produced in the markdown text string and thus honors multi-column text. This is also true for text in tables: words are extracted in the sequence of table row cells.
140143

144+
- **"text"** - page content as |Markdown| text.
145+
146+
- **"page_boxes"** - |PyMuPDFLayoutMode_Valid| a list of dictionaries representing the layout boundary boxes. Each dictionary has the following structure::
147+
148+
{
149+
"index": 0-based integer index of the box in reading sequence
150+
"class": str, # one of "text", "picture", "table", etc.
151+
"bbox": [x0, y0, x1, y1], # boundary box coordinates
152+
"pos": (start, stop) # 0-based integers: bbox_text = chunk["text"][start:stop]
153+
}
154+
141155
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
142156

143157
:arg bool page_separators: if ``True`` inserts a string ``--- end of page=n ---`` at the end of each page output. Intended for debugging purposes. The page number is 0-based. The separator string is wrapped with line breaks. Default is ``False``.
@@ -168,6 +182,12 @@ The |PyMuPDF4LLM| API
168182

169183
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`.
170184

185+
:arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability <pymupdf_layout_ocr_support>` to help analyse the page.
186+
187+
:arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German.
188+
189+
:arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high.
190+
171191
:arg bool header: boolean to switch on/off page header content. This parameter controls whether to include or omit the header content from all the document pages. Useful if the document has repetitive header content which doesn't add any value to the overall extraction data. Default is `True` meaning that header content will be written.
172192

173193
:arg bool footer: boolean to switch on/off page footer content. This parameter controls whether to include or omit the footer content from all the document pages. Useful if the document has repetitive footer content which doesn't add any value to the overall extraction data. Default is `True` meaning that footer content will be written.
@@ -180,6 +200,28 @@ The |PyMuPDF4LLM| API
180200

181201
:arg bool show_progress: Default is `False`. A value of `True` displays a progress bar as pages are being converted. Package `tqdm <https://pypi.org/project/tqdm/>`_ is used if installed, otherwise the built-in text based progress bar is used.
182202

203+
:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:
204+
205+
- **"metadata"** - a dictionary consisting of the document's metadata :attr:`Document.metadata`, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number).
206+
207+
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierarchy level, `title` a string and `pagenumber` as a 1-based page number.
208+
209+
- **"tables"** - empty list.
210+
- **"images"** - empty list.
211+
- **"graphics"** - empty list.
212+
- **"words"** - empty list.
213+
214+
- **"text"** - page content as plain text.
215+
216+
- **"page_boxes"** - a list of dictionaries representing the layout boundary boxes. Each dictionary has the following structure::
217+
218+
{
219+
"index": 0-based integer index of the box in reading sequence
220+
"class": str, # one of "text", "picture", "table", etc.
221+
"bbox": [x0, y0, x1, y1], # boundary box coordinates
222+
"pos": (start, stop) # 0-based integers: bbox_text = chunk["text"][start:stop]
223+
}
224+
183225

184226
.. method:: to_json(doc: pymupdf.Document | str, *, **kwargs) -> str
185227

@@ -189,6 +231,12 @@ The |PyMuPDF4LLM| API
189231

190232
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`.
191233

234+
:arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability <pymupdf_layout_ocr_support>` to help analyse the page.
235+
236+
:arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German.
237+
238+
:arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high.
239+
192240
:arg int image_dpi: specify the desired image resolution in dots per inch. Default value is 150. Only relevant if one of the parameters `write_images=True` or `embed_images=True` is used.
193241

194242
:arg str image_format: specify the desired image format via its extension. Default is "png" (portable network graphics). Another popular format may be "jpg". Possible values are all :ref:`supported output formats <Supported_File_Types>`. Only relevant if one of the parameters `write_images=True` or `embed_images=True` is used.
@@ -206,6 +254,24 @@ The |PyMuPDF4LLM| API
206254
:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted (`None`) all pages are processed. Specify any valid Python sequence containing integers between `0` and `page_count - 1`.
207255

208256

257+
.. method:: get_key_values(doc: pymupdf.Document | str) -> list[dict]
258+
259+
Parse the document if it is a **Form PDF** and extract key-value pairs from all form fields (widgets).
260+
261+
Please note that this method is only relevant for PDF documents that contain widgets. Otherwise, an empty list will be returned.
262+
263+
The function is always available -- independently of whether you are using |PyMuPDF Layout <pymupdf-layout>| or not.
264+
265+
Each dictionary item has the following structure::
266+
267+
{
268+
"field_name": str, # the full name of the form field, components separated by dots
269+
{
270+
"value": str # the field value as string
271+
"pages": list # list of 0-based page numbers where the field appears
272+
}
273+
}
274+
209275
.. note::
210276

211277
Please see `this site <https://github.com/pymupdf/pymupdf4llm/discussions/327>`_ for more background and the current status of further improvements regarding usage with :ref:`PyMuPDF Layout <pymupdf-layout>`.

docs/pymupdf4llm/index.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@ Features
3636
Functionality
3737
--------------------
3838

39-
- This package converts the pages of a file to text in **Markdown** format using |PyMuPDF|.
39+
- This package converts the pages of a file to plain text or in **Markdown** format using |PyMuPDF|.
4040

41-
- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text.
41+
- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text. Tables in plain text output mode are rendered using the `tabulate <https://pypi.org/project/tabulate/>`_ package.
4242

43-
- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags.
43+
- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags. When using the package together with :ref:`PyMuPDF Layout <https://pypi.org/project/pymupdf-layout/>`_, titels, section headers and page headers and footers are detected.
4444

4545
- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.
4646

0 commit comments

Comments
 (0)