You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
156
158
157
159
:arg bool page_separators: if ``True`` inserts a string ``--- end of page=n ---`` at the end of each page output. Intended for debugging purposes. The page number is 0-based. The separator string is wrapped with line breaks. Default is ``False``.
Parses the document and the specified pages and converts the result into a |JSON|-formatted string.
231
+
Parses the document and the specified pages and converts the result into a `JSONformatted string<https://docs.pdf4llm.com/python/reference/JSON-schema>`_.
228
232
229
233
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`.
230
234
@@ -246,10 +250,41 @@ The PyMuPDF4LLM API
246
250
247
251
:arg bool embed_images: store image binaries for "picture" boundary boxes. Base64-encoded images are included in the JSON output. Ignores `image_path` if used. This may drastically increase the size of your JSON text.
248
252
249
-
:arg bool write_images: store image files "picture" boundary boxes.when encountering images, image files will be created from the respective page area and stored in the specified folder. Any text contained in these areas will still be included in the text output.
253
+
:arg bool write_images: store image files "picture" boundary boxes. When encountering images, image files will be created from the respective page area and stored in the specified folder. Any text contained in these areas will still be included in the text output.
250
254
251
255
:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted (`None`) all pages are processed. Specify any valid Python sequence containing integers between `0` and `page_count - 1`.
252
256
257
+
:rtype: str
258
+
259
+
See `JSON Schema <https://docs.pdf4llm.com/python/reference/JSON-schema>`_ for the structure of the output JSON string.
260
+
261
+
262
+
.. _pymupdf4llm-api-boxclasses:
263
+
264
+
.. note::
265
+
266
+
**About box classes**
267
+
268
+
If `page_chunks = True` the return objects for `to_markdown` & `to_text` contains a list of dictionaries representing the layout boundary boxes `page_boxes`, within that a key ``class`` indicates the type of box content therein.
269
+
270
+
The return object for `to_json` contains a similar key called ``boxclass``.
271
+
272
+
The possible string values are for this ``class`` / ``boxclass`` key are:
0 commit comments