Skip to content

Commit 76eb8cf

Browse files
authored
Merge pull request #52 from kcroker/feature/improve-options
Improve usability via command-line options
2 parents 6a66e20 + 54fba47 commit 76eb8cf

12 files changed

Lines changed: 502 additions & 140 deletions

File tree

CHANGELOG.md

Lines changed: 61 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -7,159 +7,169 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## Unreleased
99

10-
* Restructure the code into smaller chunks
11-
* General maintenance work
10+
### Additions
11+
12+
* Implement page ranges for `--mode`, `--dpi` and `--quality`.
13+
* Add a `--socr` ("streamlined" OCR) option that abbreviates `--ocr '{"language": ["eng", "grc"]}'` to `--ocrs eng,grc`.
14+
* Add a `-f` short variant for `--overwrite`.
15+
16+
### Changes
17+
18+
* Deprecate the short overwrite flag `-o` in favor of `-f`.
19+
* Warnings and errors are not logger to stderr.
20+
* Restructure the code into smaller chunks.
21+
* General maintenance work.
1222

1323
## 2.5.4 - 2026-04-24
1424

15-
* Run `uv` security audit and update some dependencies
25+
* Run `uv` security audit and update some dependencies.
1626

1727
## 2.5.3 - 2026-03-25
1828

19-
* Fix broken workflow without text layer translation
20-
* Shorter names for temporary directories
21-
* Code maintenance
29+
* Fix broken workflow without text layer translation.
30+
* Shorter names for temporary directories.
31+
* Code maintenance.
2232

2333
## 2.5.2 - 2026-03-25
2434

25-
* Relax dependency versions
35+
* Relax dependency versions.
2636

2737
## 2.5.1 - 2026-03-14
2838

29-
* Allow manually configuring PDF page resolution (DPI)
39+
* Allow manually configuring PDF page resolution (DPI).
3040

3141
## 2.5.0 - 2026-03-13
3242

33-
* Account for DjVu file resolution
34-
* Simplify image diffing and regenerate better-quality fixtures
43+
* Account for DjVu file resolution.
44+
* Simplify image diffing and regenerate better-quality fixtures.
3545

3646
## 2.4.2 - 2026-02-24
3747

38-
* Fix issue where only the main process has its logger configured
48+
* Fix issue where only the main process has its logger configured.
3949

4050
## 2.4.1 - 2026-02-24
4151

42-
* Fix compatibility issues with the new OCRmyPDF API
43-
* Remove support for Python 3.10
52+
* Fix compatibility issues with the new OCRmyPDF API.
53+
* Remove support for Python 3.10.
4454

4555
## 2.4.0 - 2026-02-24
4656

47-
* Migrate to `uv` from `pyenv` + `poetry`
48-
* Update dependencies
57+
* Migrate to `uv` from `pyenv` + `poetry`.
58+
* Update dependencies.
4959

5060
## 2.3.1 - 2025-10-28
5161

52-
* Fix mixed-up email format
62+
* Fix mixed-up email format.
5363

5464
## 2.3.0 - 2025-10-28
5565

56-
* Remove support for Python 3.9
57-
* Migrate to standardized `pyproject.toml`
58-
* Update dependencies
66+
* Remove support for Python 3.9.
67+
* Migrate to standardized `pyproject.toml`.
68+
* Update dependencies.
5969

6070
## 2.2.15 - 2025-07-02
6171

62-
* Add support for installation via `pipx`
72+
* Add support for installation via `pipx`.
6373

6474
## 2.2.14 - 2025-05-27
6575

66-
* Improve installation notes
67-
* Bump djvulibre-python version
76+
* Improve installation notes.
77+
* Bump djvulibre-python version.
6878

6979
## 2.2.13 - 2025-02-12
7080

71-
* Fail-safe quality settings for non-JPEG images
81+
* Fail-safe quality settings for non-JPEG images.
7282

7383
## 2.2.12 - 2025-01-27
7484

75-
* Update pytest_image_diff and fix newly broken tests
85+
* Update pytest_image_diff and fix newly broken tests.
7686

7787
## 2.2.11 - 2025-01-26
7888

79-
* Update dependencies
89+
* Update dependencies.
8090

8191
## 2.2.10 - 2024-10-25
8292

83-
* Improve interface with OCRmyPDF
84-
* Fix CI build
93+
* Improve interface with OCRmyPDF.
94+
* Fix CI build.
8595

8696
## 2.2.9 - 2024-10-25
8797

88-
* Improve type hints
89-
* Update dependencies
98+
* Improve type hints.
99+
* Update dependencies.
90100

91101
## 2.2.8 - 2024-10-18
92102

93-
* Support single characters in the text layer
103+
* Support single characters in the text layer.
94104

95105
## 2.2.7 - 2024-08-27
96106

97-
* Improve tab and newline handling
107+
* Improve tab and newline handling.
98108

99109
## 2.2.6 - 2024-08-05
100110

101-
* Fix accidental whitespace removal from text blocks
111+
* Fix accidental whitespace removal from text blocks.
102112

103113
## 2.2.5 - 2024-07-20
104114

105-
* Re-add ability to force the image mode (RGB/Grayscale/Monochrome)
115+
* Re-add ability to force the image mode (RGB/Grayscale/Monochrome).
106116

107117
## 2.2.4 - 2024-02-24
108118

109-
* Update dependencies
119+
* Update dependencies.
110120

111121
## 2.2.3 - 2023-12-09
112122

113-
* Fix CI build
114-
* Ignore invalid UTF-8 sequences
115-
* Ignore unrecognized page titles in the outline (#23)
123+
* Fix CI build.
124+
* Ignore invalid UTF-8 sequences.
125+
* Ignore unrecognized page titles in the outline (#23).
116126

117127
## 2.2.2 - 2023-10-29
118128

119-
* Update dependencies
129+
* Update dependencies.
120130

121131
## 2.2.1 - 2023-11-06
122132

123-
* Handle invalid PDF pages
124-
* Fix exception in text layer processing (#20)
133+
* Handle invalid PDF pages.
134+
* Fix exception in text layer processing (#20).
125135

126136
## 2.2.0 - 2023-10-28
127137

128-
* Add options for disabling the text layer and for directly running OCR
138+
* Add options for disabling the text layer and for directly running OCR.
129139

130140
## 2.1.5 - 2023-10-27
131141

132-
* Fix inverted colors in images (#16)
142+
* Fix inverted colors in images (#16).
133143

134144
## 2.1.4 - 2023-10-06
135145

136-
* Fix typo in logging code
146+
* Fix typo in logging code.
137147

138148
## 2.1.3 - 2023-10-06
139149

140-
* Improve logging
150+
* Improve logging.
141151

142152
## 2.1.2 - 2023-10-02
143153

144-
* Accidental version bump
154+
* Accidental version bump.
145155

146156
## 2.1.1 - 2023-10-02
147157

148-
* Remove debug code
158+
* Remove debug code.
149159

150160
## 2.1.0 - 2023-10-02
151161

152-
* Add support for OCRmyPDF
162+
* Add support for OCRmyPDF.
153163

154164
## 2.0.2 - 2023-08-03
155165

156-
* Update some other dependencies
157-
* Replace `python-djvulibre` with `djvulibre-python`
166+
* Update some other dependencies.
167+
* Replace `python-djvulibre` with `djvulibre-python`.
158168

159169
## 2.0.1 - 2023-06-22
160170

161-
* Minor improvements in packaging
171+
* Minor improvements in packaging.
162172

163173
## 2.0.0 - 2023-05-04
164174

165-
* Fully rewrite
175+
* Fully rewrite.

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,19 @@ If you have [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) installed, you can
1616

1717
dpsprep -O3 input.djvu
1818

19-
You can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the `ocrmypdf` CLI, we use the API directly and accept options in JSON format):
19+
You can also skip translating the text layer (it is sometimes not being translated well) and redo the OCR (rather than launching the `ocrmypdf` CLI, we use the API directly and accept options in JSON format):
2020

21-
dpsprep --ocr '{"language": ["rus", "eng"]}' input.djvu
21+
dpsprep --socr rus,eng,grc input.djvu
2222

23-
Consult the man file ([online](https://github.com/kcroker/dpsprep/wiki/dpsprep.1)) for details; there are a lot of options.
23+
Sometimes the pages of scanned books are saved as colorful images. For PDF, saving bitonal page backgrounds as RGB images can inflate the file by an order of magnitude (see [below](#compression)). We try to infer the color mode of each page, however that is sometimes inefficient. In such cases, we can force the color mode as follows:
2424

25-
See the next section for different ways to run the program.
25+
dpsprep --mode bitonal input.djvu start.pdf
26+
27+
In case we want to preserve the cover page as-is, we can use ranges:
28+
29+
dpsprep --mode bitonal[2-end] input.djvu start.pdf
30+
31+
For details on these and other options, as well as the allowed range syntax, consult the man file ([online](https://github.com/kcroker/dpsprep/wiki/dpsprep.1)).
2632

2733
## Installation
2834

@@ -88,6 +94,8 @@ If you want `dpsprep` to be able to use `ocrmypdf` from `pipx`'s isolated enviro
8894

8995
### Compression
9096

97+
PDF files full of images cannot be compressed as efficiently as DjVu, leading to files that are hundreds of megabytes large. Fortunately, books are often bitonal, which allows for efficient compression like `group4` or `jbig2`. Unfortunately, in badly digitized books the scanned images may be saved as colorful JPEG files, which can partially be mitigated using `--mode bitonal` (possibly for only a range of pages).
98+
9199
We perform compression in two stages:
92100

93101
* The first one is the default compression provided by [Pillow](https://github.com/python-pillow/Pillow). For bitonal images, [the PDF generation code says](https://github.com/python-pillow/Pillow/blob/a088d54509e42e4eeed37d618b42d775c0d16ef5/src/PIL/PdfImagePlugin.py#L138C16-L138C16) that, if `libtiff` is available, `group4` compression is used.

docs/examples.man

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,17 @@ Produce an output file using a large pool of workers:
1212
.IP
1313
dpsprep --pool=16 input.djvu
1414
.P
15-
Force bitonal images:
15+
Force all pages to be bitonal:
1616
.IP
1717
dpsprep --mode bitonal input.djvu
1818
.P
19+
Force bitonal pages but leave the cover page as-is (can be useful with badly digitized books):
20+
.IP
21+
dpsprep --mode bitonal[2-end] input.djvu
22+
.P
1923
Produce an output file by disregarding the text layer and running OCRmyPDF instead:
2024
.IP
21-
dpsprep --ocr '{"language": ["rus", "eng"]}' input.djvu
25+
dpsprep --socr rus,eng,grc input.djvu
2226
.P
2327
Simply disregard the text layer without OCR:
2428
.IP

0 commit comments

Comments
 (0)