Skip to content

Commit 7e6c61c

Browse files
committed
Updates PyMuPDF4LLM section with further info on boxclass.
1 parent 64104e2 commit 7e6c61c

1 file changed

Lines changed: 37 additions & 2 deletions

File tree

docs/pymupdf4llm/api.rst

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,8 @@ The PyMuPDF4LLM API
152152
"pos": (start, stop), # 0-based integers: bbox_text = chunk["text"][start:stop]
153153
}
154154

155+
See: :ref:`box classes <pymupdf4llm-api-boxclasses>`
156+
155157
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
156158

157159
:arg bool page_separators: if ``True`` inserts a string ``--- end of page=n ---`` at the end of each page output. Intended for debugging purposes. The page number is 0-based. The separator string is wrapped with line breaks. Default is ``False``.
@@ -220,11 +222,13 @@ The PyMuPDF4LLM API
220222
"bbox": [x0, y0, x1, y1], # boundary box coordinates
221223
"pos": (start, stop), # 0-based integers: bbox_text = chunk["text"][start:stop]
222224
}
225+
226+
See: :ref:`box classes <pymupdf4llm-api-boxclasses>`
223227

224228

225229
.. method:: to_json(doc: pymupdf.Document | str, *, **kwargs) -> str
226230

227-
Parses the document and the specified pages and converts the result into a |JSON|-formatted string.
231+
Parses the document and the specified pages and converts the result into a `JSON formatted string <https://docs.pdf4llm.com/python/reference/JSON-schema>`_.
228232

229233
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`.
230234

@@ -246,10 +250,41 @@ The PyMuPDF4LLM API
246250

247251
:arg bool embed_images: store image binaries for "picture" boundary boxes. Base64-encoded images are included in the JSON output. Ignores `image_path` if used. This may drastically increase the size of your JSON text.
248252

249-
:arg bool write_images: store image files "picture" boundary boxes.when encountering images, image files will be created from the respective page area and stored in the specified folder. Any text contained in these areas will still be included in the text output.
253+
:arg bool write_images: store image files "picture" boundary boxes. When encountering images, image files will be created from the respective page area and stored in the specified folder. Any text contained in these areas will still be included in the text output.
250254

251255
:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted (`None`) all pages are processed. Specify any valid Python sequence containing integers between `0` and `page_count - 1`.
252256

257+
:rtype: str
258+
259+
See `JSON Schema <https://docs.pdf4llm.com/python/reference/JSON-schema>`_ for the structure of the output JSON string.
260+
261+
262+
.. _pymupdf4llm-api-boxclasses:
263+
264+
.. note::
265+
266+
**About box classes**
267+
268+
If `page_chunks = True` the return objects for `to_markdown` & `to_text` contains a list of dictionaries representing the layout boundary boxes `page_boxes`, within that a key ``class`` indicates the type of box content therein.
269+
270+
The return object for `to_json` contains a similar key called ``boxclass``.
271+
272+
The possible string values are for this ``class`` / ``boxclass`` key are:
273+
274+
.. code-block:: bash
275+
276+
text
277+
picture
278+
table
279+
caption
280+
title
281+
section-header
282+
page-header
283+
page-footer
284+
list-item
285+
footnote
286+
formula
287+
253288
254289
.. _pymupdf4llm-api-layout:
255290

0 commit comments

Comments
 (0)