Skip to content

Commit 447ff2c

Browse files
committed
Rework textpage_ocr
For partial OCR, we previously added text content from OCR'd images on the page. We now redact legible text and let the OCR engine recognize the remaining page content - which includes images as before but also vectors simulating text.
1 parent 1d272d7 commit 447ff2c

4 files changed

Lines changed: 173 additions & 106 deletions

File tree

docs/page.rst

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1515,34 +1515,37 @@ In a nutshell, this is what you can do with PyMuPDF:
15151515

15161516
.. method:: get_textpage_ocr(flags=3, language="eng", dpi=72, full=False, tessdata=None)
15171517

1518-
**Optical Character Recognition** (**OCR**) technology can be used to extract text data for documents where text is in a raster image format throughout the page. Use this method to **OCR** a page for text extraction.
1518+
**Optical Character Recognition** (**OCR**) technology can be used to extract text data for pages where text is in raster image or vector graphic format. Use this method to **OCR** a page for subsequent text extraction.
15191519

1520-
This method returns a :ref:`TextPage` for the page that includes OCRed text. MuPDF will invoke Tesseract-OCR if this method is used. Otherwise this is a normal :ref:`TextPage` object.
1520+
This method returns a :ref:`TextPage` for the page that includes OCRed text. MuPDF will invoke Tesseract-OCR if this method is used.
15211521

15221522
:arg int flags: indicator bits controlling the content available for subsequent test extractions and searches -- see the parameter of :meth:`Page.get_text`.
15231523
:arg str language: the expected language(s). Use "+"-separated values if multiple languages are expected, "eng+spa" for English and Spanish.
15241524
:arg int dpi: the desired resolution in dots per inch. Influences recognition quality (and execution time).
1525-
:arg bool full: whether to OCR the full page, or just the displayed images.
1526-
:arg str tessdata: The name of Tesseract's language support folder `tessdata`. If omitted, this information must be present as environment variable `TESSDATA_PREFIX`. Can be determined by function :meth:`get_tessdata`.
1525+
:arg bool full: whether to OCR the full page, or only page areas that contain no legible text.
1526+
:arg str tessdata: The name of Tesseract's language support folder `tessdata`. If omitted, the name is determined using function :meth:`get_tessdata`.
15271527

1528-
.. note:: This method does **not** support a clip parameter -- OCR will always happen for the complete page rectangle.
1528+
.. note:: This method does **not** support a clip parameter -- OCR (full or partial) will always happen for the complete page rectangle.
15291529

15301530
:returns:
15311531

15321532
a :ref:`TextPage`. Execution may be significantly longer than :meth:`Page.get_textpage`.
15331533

1534-
For a full page OCR, **all text** will have the font "GlyphlessFont" from Tesseract. In case of partial OCR, normal text will keep its properties, and only text coming from images will have the GlyphlessFont.
1534+
For ``full=True`` OCR, **all text** will have the font "GlyphLessFont" from Tesseract. In case of partial OCR (``full=False``), legible normal text will keep its properties, and only recognized text will have the GlyphLessFont.
15351535

1536-
.. note::
1537-
1538-
**OCRed text is only available** to PyMuPDF's text extractions and searches if their `textpage` parameter specifies the output of this method.
1536+
Recognized / OCR text will follow (legible) normal text for partial OCR and will thus not be in reading order. Establishing reading order is -- as always -- your responsibility.
1537+
1538+
.. note::
1539+
1540+
Text extraction results, including any OCR, are stored in the returned :ref:`TextPage`. To access them, you must use the ``textpage`` parameter in all subsequent text extraction and search methods.
15391541

1540-
`This Jupyter notebook <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/jupyter-notebooks/partial-ocr.ipynb>`_ walks through an example for using OCR textpages.
1542+
`This Jupyter notebook <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/jupyter-notebooks/partial-ocr.ipynb>`_ walks through an example for using OCR textpages.
15411543

15421544
|history_begin|
15431545

15441546
* New in v.1.19.0
15451547
* Changed in v1.19.1: support full and partial OCRing a page.
1548+
* changed in v1.27.2: For partial OCR, **all** page areas outside legible text are now OCRed, not just those within images. This means that OCR will now also be performed for vector graphics, and for text containing illegible characters.
15461549

15471550
|history_end|
15481551

src/utils.py

Lines changed: 113 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,6 @@
1414
from . import pymupdf
1515
except Exception:
1616
import pymupdf
17-
try:
18-
from . import mupdf
19-
except Exception:
20-
import mupdf
2117

2218
_format_g = pymupdf.format_g
2319

@@ -322,80 +318,142 @@ def get_textpage_ocr(
322318
full: bool = False,
323319
tessdata: str = None,
324320
) -> pymupdf.TextPage:
325-
"""Create a Textpage from combined results of normal and OCR text parsing.
321+
"""Create a Textpage from the OCR version of the page.
322+
323+
OCR can be executed for the full page image, or (the default) only
324+
for areas that are not covered by readable digital text.
326325
327326
Args:
328327
flags: (int) control content becoming part of the result.
329328
language: (str) specify expected language(s). Default is "eng" (English).
330329
dpi: (int) resolution in dpi, default 72.
331-
full: (bool) whether to OCR the full page image, or only its images (default)
330+
full: (bool) whether to OCR the full page, or to keep legible text
331+
tessdata: (str) path to Tesseract language data files. If None, the
332+
built-in function is used to find the path.
332333
"""
333334
pymupdf.CheckParent(page)
334335
tessdata = pymupdf.get_tessdata(tessdata)
335336

337+
# Ensure 0xFFFD is not suppressed
338+
flags = (
339+
flags
340+
& ~pymupdf.TEXT_USE_CID_FOR_UNKNOWN_UNICODE # pylint: disable=no-member
341+
& ~pymupdf.TEXT_USE_GID_FOR_UNKNOWN_UNICODE # pylint: disable=no-member
342+
)
343+
336344
def full_ocr(page, dpi, language, flags):
337-
zoom = dpi / 72
338-
mat = pymupdf.Matrix(zoom, zoom)
339-
pix = page.get_pixmap(matrix=mat)
345+
"""Perform OCR for the full page image."""
346+
pix = page.get_pixmap(dpi=dpi)
347+
# create a 1-page PDF with an OCR text layer.
340348
ocr_pdf = pymupdf.Document(
341-
"pdf",
342-
pix.pdfocr_tobytes(
343-
compress=False,
344-
language=language,
345-
tessdata=tessdata,
346-
),
347-
)
349+
stream=pix.pdfocr_tobytes(
350+
compress=False,
351+
language=language,
352+
tessdata=tessdata,
353+
),
354+
)
348355
ocr_page = ocr_pdf.load_page(0)
349356
unzoom = page.rect.width / ocr_page.rect.width
350357
ctm = pymupdf.Matrix(unzoom, unzoom) * page.derotation_matrix
351358
tpage = ocr_page.get_textpage(flags=flags, matrix=ctm)
352-
ocr_pdf.close()
353-
pix = None
359+
360+
# associate the textpage with the original page
354361
tpage.parent = weakref.proxy(page)
355362
return tpage
356363

364+
def partial_ocr(page, dpi, language, flags):
365+
"""Perform OCR for parts of the page without legible text.
366+
367+
We create a temporary PDF for which we can freely redact text.
368+
"""
369+
doc = page.parent
370+
371+
# make temporary PDF with the passed-in page
372+
temp_pdf = pymupdf.open()
373+
temp_pdf.insert_pdf(doc, from_page=page.number, to_page=page.number)
374+
temp_page = temp_pdf.load_page(0)
375+
temp_page.remove_rotation() # avoid OCR problems with rotated pages
376+
377+
# extract text bboxes from the page
378+
tp = temp_page.get_textpage(flags=flags)
379+
blocks = tp.extractDICT()["blocks"]
380+
381+
"""
382+
For partial OCR we need a TextPage that contains legible text only.
383+
Illegible text must be passed to the OCR engine.
384+
"""
385+
# Select spans with illegible text. If present, remove them first.
386+
fffd_spans = [
387+
s["bbox"]
388+
for b in blocks
389+
if b["type"] == 0
390+
for l in b["lines"]
391+
for s in l["spans"]
392+
if chr(0xFFFD) in s["text"]
393+
]
394+
if fffd_spans:
395+
for bbox in fffd_spans:
396+
temp_page.add_redact_annot(bbox)
397+
temp_page.apply_redactions(
398+
images=pymupdf.PDF_REDACT_IMAGE_NONE, # pylint: disable=no-member
399+
graphics=pymupdf.PDF_REDACT_LINE_ART_NONE, # pylint: disable=no-member
400+
text=pymupdf.PDF_REDACT_TEXT_REMOVE, # pylint: disable=no-member
401+
)
402+
# Extract text again, now without the unreadable spans.
403+
tp = temp_page.get_textpage(flags=flags)
404+
blocks = tp.extractDICT()["blocks"]
405+
# We also need a fresh copy of the original page.
406+
temp_pdf.insert_pdf(doc, from_page=page.number, to_page=page.number)
407+
temp_page = temp_pdf.load_page(-1)
408+
temp_page.remove_rotation() # avoid OCR problems with rotated pages
409+
410+
span_bboxes = [
411+
s["bbox"]
412+
for b in blocks
413+
if b["type"] == 0
414+
for l in b["lines"]
415+
for s in l["spans"]
416+
if not chr(0xFFFD) in s["text"]
417+
]
418+
419+
# Remove digital text by redacting the span bboxes.
420+
# Then OCR the remainder of the page.
421+
for bbox in span_bboxes:
422+
temp_page.add_redact_annot(bbox)
423+
424+
# only remove text, no images, no vectors
425+
temp_page.apply_redactions(
426+
images=pymupdf.PDF_REDACT_IMAGE_NONE, # pylint: disable=no-member
427+
graphics=pymupdf.PDF_REDACT_LINE_ART_NONE, # pylint: disable=no-member
428+
text=pymupdf.PDF_REDACT_TEXT_REMOVE, # pylint: disable=no-member
429+
)
430+
pix = temp_page.get_pixmap(dpi=dpi)
431+
# matrix = pymupdf.Rect(pix.irect).torect(page.rect)
432+
433+
# OCR the redacted page
434+
ocr_pdf = pymupdf.open(
435+
stream=pix.pdfocr_tobytes(
436+
compress=False,
437+
language=language,
438+
tessdata=tessdata,
439+
),
440+
)
441+
ocr_page = ocr_pdf[0]
442+
443+
# Extend the original textpage with OCR-ed text.
444+
ocr_page.extend_textpage(tp, flags=pymupdf.TEXT_ACCURATE_BBOXES)
445+
446+
# associate the textpage with the original page
447+
tp.parent = weakref.proxy(page)
448+
return tp
449+
357450
# if OCR for the full page, OCR its pixmap @ desired dpi
358451
if full:
359452
return full_ocr(page, dpi, language, flags)
360453

361454
# For partial OCR, make a normal textpage, then extend it with text that
362-
# is OCRed from each image.
363-
# Because of this, we need the images flag bit set ON.
364-
tpage = page.get_textpage(flags=flags)
365-
for block in page.get_text("dict", flags=pymupdf.TEXT_PRESERVE_IMAGES)["blocks"]:
366-
if block["type"] != 1: # only look at images
367-
continue
368-
bbox = pymupdf.Rect(block["bbox"])
369-
if bbox.width <= 3 or bbox.height <= 3: # ignore tiny stuff
370-
continue
371-
try:
372-
pix = pymupdf.Pixmap(block["image"]) # get image pixmap
373-
if pix.n - pix.alpha != 3: # we need to convert this to RGB!
374-
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
375-
if pix.alpha: # must remove alpha channel
376-
pix = pymupdf.Pixmap(pix, 0)
377-
imgdoc = pymupdf.Document(
378-
"pdf",
379-
pix.pdfocr_tobytes(language=language, tessdata=tessdata),
380-
) # pdf with OCRed page
381-
imgpage = imgdoc.load_page(0) # read image as a page
382-
pix = None
383-
# compute matrix to transform coordinates back to that of 'page'
384-
imgrect = imgpage.rect # page size of image PDF
385-
shrink = pymupdf.Matrix(1 / imgrect.width, 1 / imgrect.height)
386-
mat = shrink * block["transform"]
387-
imgpage.extend_textpage(tpage, flags=0, matrix=mat)
388-
imgdoc.close()
389-
except (RuntimeError, mupdf.FzErrorBase):
390-
if 0 and g_exceptions_verbose:
391-
# Don't show exception info here because it can happen in
392-
# normal operation (see test_3842b).
393-
pymupdf.exception_info()
394-
tpage = None
395-
pymupdf.message("Falling back to full page OCR")
396-
return full_ocr(page, dpi, language, flags)
397-
398-
return tpage
455+
# is OCRed from the rest of page.
456+
return partial_ocr(page, dpi, language, flags)
399457

400458

401459
def get_text(
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
NIST SP 800-223
2+
3+
High-Performance Computing Security
4+
February 2024
5+
6+
7+
iii
8+
Table of Contents
9+
1. Introduction ...................................................................................................................................1
10+
2. HPC System Reference Architecture and Main Components ............................................................2
11+
2.1.1. Components of the High-Performance Computing Zone ............................................................. 3
12+
2.1.2. Components of the Data Storage Zone ........................................................................................ 4
13+
2.1.3. Parallel File System ....................................................................................................................... 4
14+
2.1.4. Archival and Campaign Storage .................................................................................................... 5
15+
2.1.5. Burst Buffer .................................................................................................................................. 5
16+
2.1.6. Components of the Access Zone .................................................................................................. 6
17+
2.1.7. Components of the Management Zone ....................................................................................... 6
18+
2.1.8. General Architecture and Characteristics .................................................................................... 6
19+
2.1.9. Basic Services ................................................................................................................................ 7
20+
2.1.10. Configuration Management ....................................................................................................... 7
21+
2.1.11. HPC Scheduler and Workflow Management .............................................................................. 7
22+
2.1.12. HPC Software .............................................................................................................................. 8
23+
2.1.13. User Software ............................................................................................................................. 8
24+
2.1.14. Site-Provided Software and Vendor Software ........................................................................... 8
25+
2.1.15. Containerized Software in HPC .................................................................................................. 9
26+
3. HPC Threat Analysis...................................................................................................................... 10
27+
3.2.1. Access Zone Threats ................................................................................................................... 11
28+
3.2.2. Management Zone Threats ........................................................................................................ 11
29+
3.2.3. High-Performance Computing Zone Threats .............................................................................. 12
30+
3.2.4. Data Storage Zone Threats ......................................................................................................... 12
31+
4. HPC Security Posture, Challenges, and Recommendations ............................................................. 14
32+
5. Conclusions .................................................................................................................................. 19
33+
2.1. Main COMPONENNS..........cccccssccccssssccccssssccccssnsecccsssseeccessseeecsessseecssaseecsessseeceessseecseeaseecsessseeesessseeessstseeesD
34+
3.1. Key HPC Security Characteristics and Use REquireMent............cccsscccessscesseceessecesseeesssecesseeestteessteee LO
35+
3.2. Threats to HPC FUNCTION ZONES..........cesccesscesscesscesscesecsssesssecssscesscesscsseessessescesssesscessessssssssssssssssssees LO
36+
3.3. Other Threats ........cccccsccsscssscsssccssscssscssscssscsssesssesssscssscssessseesseeseessesscsssssssessessesssessssssssssssssssssssseesesLO
37+
4.1. HPC Access Control via Network SEgMeNtatiOn ..........:ccccscccsssccessecesssecesecesssecessecessecessteessecessteessee LO
38+
4.2. Compute Node Sanitization ...........cccccessccssssccessecessecesseecssseccseecsseecsseecesseesessscssssesssescssssessssesssessses
39+
LD
40+
4.3. Data Integrity Protection ............cccccccccccccsssssssscececccesssssssseceeccessscssssseeeccesesssssssseeesessssssstsssesesssssssesLOD
41+
4.4. SECUFING CONTAINELSS ........ccccssccccssssccccessseeccesseeccsssssecceesssecceessseccesssseecsessseeccsssssescsssssescssssssscssssesesesLO
42+
4.5. Achieving Security While Maintaining HPC Performance. ..........cc:cccssccsssseesssecessecesssecessecessseesseeesee LZ
43+
4.6. Challenges to HPC Security TOols...........c:ccccssccssseccsssecesseecesecessseccsseecssseecsseecseseecssesesstscssssesssessssessse LZ

0 commit comments

Comments
 (0)