Skip to content

Commit 4e38ff9

Browse files
committed
Update textpage.rst
1 parent 8cdc0e2 commit 4e38ff9

1 file changed

Lines changed: 31 additions & 25 deletions

File tree

docs/textpage.rst

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ For a description of what this class is all about, see Appendix 2.
9898

9999
.. method:: extractJSON(sort=False)
100100

101-
Textpage content as a JSON string. Created by `json.dumps(TextPage.extractDICT())`. It is included for backlevel compatibility. You will probably use this method ever only for outputting the result to some file. The method detects binary image data and converts them to base64 encoded strings.
101+
Textpage content as a JSON string. Created by `json.dumps(TextPage.extractDICT())`. It is included for backlevel compatibility. You will probably use this method ever only for outputting the result :meth:`TextPage.extractDICT` to some file. The method detects binary image data and converts them to base64 encoded strings.
102102

103103
:arg bool sort: (new in v1.19.1) sort the output by vertical, then horizontal coordinates. In many cases, this should suffice to generate a "natural" reading order.
104104

@@ -164,9 +164,9 @@ For a description of what this class is all about, see Appendix 2.
164164

165165
Structure of Dictionary Outputs
166166
--------------------------------
167-
Methods :meth:`TextPage.extractDICT`, :meth:`TextPage.extractJSON`, :meth:`TextPage.extractRAWDICT`, and :meth:`TextPage.extractRAWJSON` return dictionaries, containing the page's text and image content. The dictionary structures of all four methods are almost equal. They strive to map the text page's information hierarchy of blocks, lines, spans and characters as precisely as possible, by representing each of these by its own sub-dictionary:
167+
Methods :meth:`TextPage.extractDICT`, :meth:`TextPage.extractJSON`, :meth:`TextPage.extractRAWDICT`, and :meth:`TextPage.extractRAWJSON` return dictionaries, containing the page's vector grphics, text and image content. The dictionary structures of all four methods are almost equal. They strive to map the text page's information hierarchy of blocks, lines, spans and characters as precisely as possible, by representing each of these by its own sub-dictionary:
168168

169-
* A **page** consists of a list of **block dictionaries**.
169+
* A **page** consists of a list of **block dictionaries** for images, vectors and text.
170170
* A (text) **block** consists of a list of **line dictionaries**.
171171
* A **line** consists of a list of **span dictionaries**.
172172
* A **span** either consists of the text itself or, for the RAW variants, a list of **character dictionaries**.
@@ -214,18 +214,18 @@ Block dictionaries come in different formats for **vector blocks**, **image bloc
214214

215215
**Vector block:**
216216

217-
=============== =========================================================================================================================
217+
========== ==========================================================================================================================================
218218
**Key** **Value**
219-
=============== =========================================================================================================================
220-
type 3 = vector (``int``)
221-
bbox vector bbox on page (:data:`rect_like`)
222-
number block count (``int``)
223-
stroked either stroked (``True``) or filled (``False``) (``bool``)
224-
isrect whether the vector is axis-parallel (``bool``). Can be a line or a rectangle. Curves or diagonal lines are ``False``.
225-
continues whether the vector is (not the last) part of a sequence of vectors in a *path* (``bool``).
226-
color sRGB integer, e.g. 0xRRGGBB (``int``).
227-
alpha Transparency, a value in ``range(256)`` (``int``).
228-
=============== =========================================================================================================================
219+
========== ==========================================================================================================================================
220+
type 3 = vector (``int``)
221+
bbox vector bbox on page (:data:`rect_like`)
222+
number block count (``int``)
223+
stroked either stroked (``True``) or filled (``False``) (``bool``)
224+
isrect whether the vector is axis-parallel (``bool``). Can be a line or a rectangle. Curves and non axis-parallel lines are ``False``.
225+
continues whether the vector is (not the last) part of a sequence of vectors in a *path* (``bool``).
226+
color sRGB integer, e.g. 0xRRGGBB (``int``).
227+
alpha Transparency, a value in ``range(256)`` (``int``).
228+
========== ==========================================================================================================================================
229229

230230
This information is a true subset of the output of :meth:`Page.get_drawings`. Its advantage is its speed (because it is extracted alongside one :ref:`TextPage` creation) and the fact that vector blocks are included in the overall page content sequence together with text and images.
231231

@@ -376,17 +376,23 @@ Bits 1 thru 4 are font properties, i.e. encoded in the font program. Please note
376376

377377
*"char_flags"* is an integer, which represents extra character properties:
378378

379-
* bit 0: strikeout.
380-
* bit 1: underline.
381-
* bit 2: synthetic (always 0, see char dictionary).
382-
* bit 3: filled.
383-
* bit 4: stroked.
384-
* bit 5: clipped.
379+
* bit 0, (``mupdf.FZ_STEXT_STRIKEOUT`` = 1). Text is striked out. Only meaningful if the extraction flag bit :data:`TEXT_COLLECT_STYLES` is set.
380+
* bit 1, (``mupdf.FZ_STEXT_UNDERLINE`` = 2). Text is underlined. Only meaningful if the extraction flag bit :data:`TEXT_COLLECT_STYLES` is set.
381+
* bit 2, (``mupdf.FZ_STEXT_SYNTHETIC`` = 4). Always 0. Shown as ``synthetic=True`` in character dictionary if it is a **generated** space.
382+
* bit 3, (``mupdf.FZ_STEXT_BOLD`` = 8). Text is bold. Set in addition to the font flag. Also set for "fake bold" if extraction flag bit :data:`TEXT_COLLECT_STYLES` is set.
383+
* bit 4, (``mupdf.FZ_STEXT_FILLED`` = 16). The glyphs of the text are **"filled"** graphics (the default).
384+
* bit 5, (``mupdf.FZ_STEXT_STROKED`` = 32). The glyphs of the text are **"stroked"** graphics.
385+
* bit 6, (``mupdf.FZ_STEXT_CLIPPED`` = 64). This is clipped text and can only be present if extraction flag bit :data:`TEXT_MEDIABOX_CLIP` was **not** set.
386+
* bit 7, (``mupdf.FZ_STEXT_UNICODE_IS_CID`` = 128). Only set if the extraction flag bit :data:`TEXT_USE_CID_FOR_UNKNOWN_UNICODE` is used.
387+
* bit 8, (``mupdf.FZ_STEXT_UNICODE_IS_GID`` = 256). Only set if the extraction flag bit :data:`TEXT_USE_GID_FOR_UNKNOWN_UNICODE` is used.
388+
* bit 9, (``mupdf.FZ_STEXT_SYNTHETIC_LARGE`` = 512). Currently not used in PyMuPDF.
389+
390+
For example if not filled and not stroked then the text will be invisible. Can be tested like this::
385391

386-
For example if not filled and not stroked (`if not (char_flags & 2**3 & 2**4):
387-
...`) then the text will be invisible.
392+
>>> if not span["char_flags"] & mupdf.FZ_STEXT_FILLED & mupdf.FZ_STEXT_STROKED:
393+
print(f"invisible text {span['text']=}")
388394

389-
(`char_flags` is new in v1.25.2.)
395+
.. note:: The text layer of an OCR-ed page is usually (not always!) written as "ignored" text -- which means it is neither filled nor stroked. This is however not the only way to make text invisible. A better, but still incomplete invisibility check is the condition ``span["alpha"] == 0``.
390396

391397

392398
Character Dictionary for :meth:`extractRAWDICT`
@@ -397,11 +403,11 @@ Character Dictionary for :meth:`extractRAWDICT`
397403
=============== ===========================================================
398404
origin character's left baseline point, :data:`point_like`
399405
bbox character rectangle, :data:`rect_like`
400-
synthetic bool.
406+
synthetic bool. ``True`` if character is a generated space.
401407
c the character (unicode)
402408
=============== ===========================================================
403409

404-
(`synthetic` is new in v1.25.3.)
410+
Key `"synthetic"` is new in v1.25.3.0. It is `True`, if the character is a **generated space** -- i.e., not part of the original text, but created by MuPDF to fill gaps between words. Please note that this can only happen if extraction flag bit :data:`TEXT_INHIBIT_SPACES` is **not** set.
405411

406412
This image shows the relationship between a character's bbox and its quad: |textpagechar|
407413

0 commit comments

Comments
 (0)