You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,13 +16,19 @@ If you have [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) installed, you can
16
16
17
17
dpsprep -O3 input.djvu
18
18
19
-
You can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the `ocrmypdf` CLI, we use the API directly and accept options in JSON format):
19
+
You can also skip translating the text layer (it is sometimes not being translated well) and redo the OCR (rather than launching the `ocrmypdf` CLI, we use the API directly and accept options in JSON format):
Consult the man file ([online](https://github.com/kcroker/dpsprep/wiki/dpsprep.1)) for details; there are a lot of options.
23
+
Sometimes the pages of scanned books are saved as colorful images. For PDF, saving bitonal page backgrounds as RGB images can inflate the file by an order of magnitude (see [below](#compression)). We try to infer the color mode of each page, however that is sometimes inefficient. In such cases, we can force the color mode as follows:
24
24
25
-
See the next section for different ways to run the program.
25
+
dpsprep --mode bitonal input.djvu start.pdf
26
+
27
+
In case we want to preserve the cover page as-is, we can use ranges:
For details on these and other options, as well as the allowed range syntax, consult the man file ([online](https://github.com/kcroker/dpsprep/wiki/dpsprep.1)).
26
32
27
33
## Installation
28
34
@@ -88,6 +94,8 @@ If you want `dpsprep` to be able to use `ocrmypdf` from `pipx`'s isolated enviro
88
94
89
95
### Compression
90
96
97
+
PDF files full of images cannot be compressed as efficiently as DjVu, leading to files that are hundreds of megabytes large. Fortunately, books are often bitonal, which allows for efficient compression like `group4` or `jbig2`. Unfortunately, in badly digitized books the scanned images may be saved as colorful JPEG files, which can partially be mitigated using `--mode bitonal` (possibly for only a range of pages).
98
+
91
99
We perform compression in two stages:
92
100
93
101
* The first one is the default compression provided by [Pillow](https://github.com/python-pillow/Pillow). For bitonal images, [the PDF generation code says](https://github.com/python-pillow/Pillow/blob/a088d54509e42e4eeed37d618b42d775c0d16ef5/src/PIL/PdfImagePlugin.py#L138C16-L138C16) that, if `libtiff` is available, `group4` compression is used.
0 commit comments