Skip to content

Commit 27be51c

Browse files
Squashed commit of the following:
commit 0dd4925 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Jun 12 07:31:46 2025 -0400 Update CITATION.cff commit c6a24be Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Jun 12 07:23:30 2025 -0400 Bump version to 0.11.7 commit 51f3065 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Jun 12 07:21:29 2025 -0400 Update CHANGELOG.md commit 738f6f0 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jun 11 23:40:50 2025 -0400 Add test for CLI auto-help commit b88907f Author: mara004 <geisserml@gmail.com> Date: Fri May 2 23:07:05 2025 +0200 Minor cleanup around pypdfium2 integration commit 7e364e6 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jun 11 22:24:28 2025 -0400 Add Page.trimbox, .bleedbox, .artbox (jsvine#1313) Thanks to @samuelbradshaw for the suggestion! commit 4c7e092 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Fri May 16 08:20:30 2025 -0400 Upgrade pdfminer.six from 20250327 to 20250506 ... and adjust color handling accordingly. commit 3e0d4df Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jun 11 23:26:09 2025 -0400 Run make format commit cd6fd70 Author: nobody <github2@invisiblehand.church> Date: Mon May 19 08:31:53 2025 -0400 Auto-add --help if CLI run w/o args (Commit message edited by @jsvine.) commit 02ff431 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 23:21:17 2025 -0400 Tiny tweaks to CHANGELOG.md commit 8cd8e48 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 23:15:41 2025 -0400 Bump version to 0.11.6 commit 44b078c Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 23:15:06 2025 -0400 Update CHANGELOG.md commit e15ed98 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 22:44:25 2025 -0400 Fix bug w/ use_text_flow=True extractions (jsvine#1279) ... related to flows where text bounces between lines. h/t @samuelbradshaw commit f2ad942 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 22:00:14 2025 -0400 Add another oss-fuzz test case, already fixed commit 748ff31 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 21:58:17 2025 -0400 More broadly handle RecursionError, via oss-fuzz commit 9148810 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 21:57:21 2025 -0400 Fix unhandled None in do_PDFStream, via oss-fuzz commit 3fcb493 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Thu Mar 27 21:31:06 2025 -0400 Bump pdfminer.six to version 20250327 commit 7e28e76 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Tue Mar 25 23:03:13 2025 -0400 Remove test_issue_1089 (jsvine#1263) @booxter makes a good point that the test is platform-specific. The issue has been resolved, and it's not expected to return, so I think provisionally OK to remove this test. commit 630f30e Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Tue Mar 25 22:52:47 2025 -0400 pragma:nocover exceptions no longer raised by pdfminer.six commit 12a73a2 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Tue Mar 25 22:52:16 2025 -0400 Bump pdfminer.six to version 20250324 commit 6349adb Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Mon Feb 10 22:09:28 2025 -0500 Add escapechar for .to_csv(...) commit 980494a Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Mon Feb 10 21:54:10 2025 -0500 Use csv.QUOTE_MINIMAL for .to_csv(...) commit 47a7ab8 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Mon Feb 10 21:53:17 2025 -0500 Update exception handler commit 8f5f498 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Feb 9 17:23:37 2025 -0500 Fix wrong exception expectation in test commit 43ccc5b Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Feb 9 16:23:57 2025 -0500 Catch exceptions from pdfminer and malformed PDFs ... thanks to OSS-Fuzz and @ennamarie19 Cf.: google/oss-fuzz#12949 commit a77808a Merge: c562774 5d47d5a Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Feb 2 11:16:58 2025 -0500 Merge pull request jsvine#1270 from mara004/patch-1 test_issue_1089: update wording regarding pypdfium2 commit 5d47d5a Author: mara004 <geisserml@gmail.com> Date: Sun Feb 2 16:27:53 2025 +0100 test_issue_1089: update wording regarding pypdfium2 See jsvine#1089 (comment) for background commit c562774 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jan 1 10:21:18 2025 -0500 Bump version to 0.11.5 commit 4af0e1d Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jan 1 10:21:00 2025 -0500 Update CHANGELOG.md commit 7c63541 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Jan 1 10:26:04 2025 -0500 Add thanks to @stolarczyk in README.md commit 078df97 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Tue Dec 31 09:11:32 2024 -0500 Fix jsvine#1237 (tf → table_settings) h/t @n-traore And thanks to @cmdlineluser for the nudge. commit 6e54799 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sat Dec 28 12:13:32 2024 -0500 Add thanks to @brandonrobertz (jsvine#1235) commit 69d010a Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Dec 15 23:24:31 2024 -0500 Add initial test/docs for `format --text` (jsvine#1235) commit e0ee254 Merge: 28d4f50 f3f2b57 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Dec 15 23:07:14 2024 -0500 Merge pull request jsvine#1235 from brandonrobertz/add-text-output-mode Add a --format text option commit f3f2b57 Author: Brandon Roberts <brandon@bxroberts.org> Date: Tue Dec 10 14:21:22 2024 -0800 Add a --format text option I use this regularly because pdfplumber has among the best layout preserving methods for PDFs, especially machine generated ones. Exposing the page output via CLI lets me use pdfplumber as a general purpose PDF-to-text tool. Usage: pdfplumber --format text file.pdf > file.txt commit 28d4f50 Merge: ea3b3e5 2073164 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Dec 8 23:10:15 2024 -0500 Merge PR jsvine#1195 commit 2073164 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Dec 8 22:55:30 2024 -0500 Appease linter commit c80c78d Author: Michal Stolarczyk <stolarczyk.michal93@gmail.com> Date: Fri Nov 22 16:48:19 2024 +0100 add a test to cover raise_unicode_errors parameter commit 1e4b48a Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Fri Nov 22 08:18:11 2024 -0500 Run 'make format' and ignore code line-length commit 138abab Author: Michal Stolarczyk <stolarczyk.michal93@gmail.com> Date: Wed Nov 13 18:34:35 2024 +0100 rename warn_unicode_error to raise_unicode_errors for clarity additionally change the default accordingly commit ea3b3e5 Merge: 6ef62c9 8542adb Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Sun Nov 10 22:47:33 2024 -0500 Merge pull request jsvine#1221 from erghelium/develop Fix broken link to Anssi Nurminen's master's thesis in the README.md commit 8542adb Author: Guilherme <101049490+erghelium@users.noreply.github.com> Date: Sun Nov 10 18:19:04 2024 -0300 Fix broken link to Anssi Nurminen's master's thesis in README commit 6ef62c9 Author: Jeremy Singer-Vine <jsvine@gmail.com> Date: Wed Oct 2 21:11:38 2024 -0400 Add `name` property to `image` objects (jsvine#1201) h/t @djr2015 commit 396c5e3 Author: Michal Stolarczyk <stolarczyk.michal93@gmail.com> Date: Fri Aug 30 10:24:39 2024 +0200 warn on unicode decoding errors in PDF annotations in some cases the the annotations may contain some junk that hinders annotations processing altogether. This allows to ignore the error and warn instead, which is configurable via warn_unicode_error arguments in the PDF initializer and/or open() method.
1 parent 5a65a03 commit 27be51c

42 files changed

Lines changed: 269 additions & 95 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,44 @@
22

33
All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).
44

5+
## [0.11.7] - 2025-06-12
6+
7+
### Added
8+
- Add access to `Page.trimbox`, `Page.bleedbox`, and `Page.artbox` (h/t @samuelbradshaw). ([#1313](https://github.com/jsvine/pdfplumber/issues/1313) + [7e364e6](https://github.com/jsvine/pdfplumber/commit/7e364e6193c6e8bafa9b46587c0fdd4a46405399))
9+
10+
### Changed
11+
- Upgrade `pdfminer.six` from `20250327` to `20250506`. ([4c7e092](https://github.com/jsvine/pdfplumber/commit/4c7e092))
12+
13+
### Removed
14+
- Remove `stroking_pattern` and `non_stroking_pattern` object attributes, due to changes in `pdfminer.six`. ([4c7e092](https://github.com/jsvine/pdfplumber/commit/4c7e092))
15+
16+
## [0.11.6] - 2025-03-27
17+
### Changed
18+
- Upgrade `pdfminer.six` from `20231228` to `20250327` ([3fcb493](https://github.com/jsvine/pdfplumber/commit/3fcb493) + [12a73a2](https://github.com/jsvine/pdfplumber/commit/12a73a2))
19+
- Use csv.QUOTE_MINIMAL for .to_csv(...) ([980494a](https://github.com/jsvine/pdfplumber/commit/980494a))
20+
21+
22+
### Fixed
23+
- Fix bug with `use_text_flow=True` text extraction (h/t @samuelbradshaw)([#1279](https://github.com/jsvine/pdfplumber/issues/1279) + [e15ed98](https://github.com/jsvine/pdfplumber/commit/e15ed98))
24+
- Catch exceptions from pdfminer and malformed PDFs ([43ccc5b](https://github.com/jsvine/pdfplumber/commit/43ccc5b))
25+
- More broadly handle RecursionError ([748ff31](https://github.com/jsvine/pdfplumber/commit/748ff31))
26+
27+
### Removed
28+
- Remove test_issue_1089 ([#1263](https://github.com/jsvine/pdfplumber/issues/1263) + [7e28e76](https://github.com/jsvine/pdfplumber/commit/7e28e76))
29+
30+
## [0.11.5] - 2025-01-01
31+
32+
### Added
33+
34+
- Add `--format text` options to CLI (in addition to previously-available `csv` and `json`) (h/t @brandonrobertz). ([#1235](https://github.com/jsvine/pdfplumber/pull/1235))
35+
- Add `raise_unicode_errors: bool` parameter to `pdfplumber.open()` to allow bypassing `UnicodeDecodeError`s in annotation-parsing and generate warnings instead (h/t @stolarczyk). ([#1195](https://github.com/jsvine/pdfplumber/issues/1195))
36+
- Add `name` property to `image` objects (h/t @djr2015). ([#1201](https://github.com/jsvine/pdfplumber/discussions/1201))
37+
38+
### Fixed
39+
40+
- Fix `PageImage.debug_tablefinder(...)` so that its main keyword argument is named the same (`table_settings=`) as other related `Page` methods (h/t @n-traore). ([#1237](https://github.com/jsvine/pdfplumber/issues/1237))
41+
42+
543
## [0.11.4] - 2024-08-18
644

745
### Fixed

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
cff-version: 1.2.0
22
title: pdfplumber
33
type: software
4-
version: 0.11.4
5-
date-released: "2024-08-07"
4+
version: 0.11.7
5+
date-released: "2025-06-12"
66
authors:
77
- family-names: "Singer-Vine"
88
given-names: "Jeremy"

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The output will be a CSV containing info about every character, line, and rectan
4747

4848
| Argument | Description |
4949
|----------|-------------|
50-
|`--format [format]`| `csv` or `json`. The `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes.|
50+
|`--format [format]`| `csv`, `json`, or `text`. The `csv` and `json` formats return information about each object. Of those two, the `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The `text` option returns a plain-text representation of the PDF, using `Page.extract_text(layout=True)`.|
5151
|`--pages [list of pages]`| A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.|
5252
|`--types [list of object types to extract]`| Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.|
5353
|`--laparams`| A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.|
@@ -274,6 +274,7 @@ Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to seve
274274
|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|
275275
|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|
276276
|`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."|
277+
|`name`| "The name by which this image XObject is referenced in the XObject subdictionary of the current resource dictionary." [🔗](https://ghostscript.com/~robin/pdf_reference17.pdf#page=340) |
277278
|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|
278279
|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|
279280
|`object_type`| "image"|
@@ -354,7 +355,7 @@ Note: The methods above are built on Pillow's [`ImageDraw` methods](http://pillo
354355

355356
## Extracting tables
356357

357-
`pdfplumber`'s approach to table detection borrows heavily from [Anssi Nurminen's master's thesis](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3), and is inspired by [Tabula](https://github.com/tabulapdf/tabula-extractor/issues/16). It works like this:
358+
`pdfplumber`'s approach to table detection borrows heavily from [Anssi Nurminen's master's thesis](https://trepo.tuni.fi/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3), and is inspired by [Tabula](https://github.com/tabulapdf/tabula-extractor/issues/16). It works like this:
358359

359360
1. For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.
360361
2. Merge overlapping, or nearly-overlapping, lines.
@@ -567,6 +568,9 @@ Many thanks to the following users who've contributed ideas, features, and fixes
567568
- [Quentin André](https://github.com/QuentinAndre11)
568569
- [Léo Roux](https://github.com/leorouxx)
569570
- [@wodny](https://github.com/wodny)
571+
- [Michal Stolarczyk](https://github.com/stolarczyk)
572+
- [Brandon Roberts](https://github.com/brandonrobertz)
573+
- [@ennamarie19](https://github.com/ennamarie19)
570574

571575
## Contributing
572576

pdfplumber/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
version_info = (0, 11, 4)
1+
version_info = (0, 11, 7)
22
__version__ = ".".join(map(str, version_info))

pdfplumber/cli.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@
88

99
from .pdf import PDF
1010

11+
if len(sys.argv) == 1:
12+
sys.argv.append("--help")
13+
1114

1215
def parse_page_spec(p_str: str) -> List[int]:
1316
if "-" in p_str:
@@ -37,7 +40,7 @@ def parse_args(args_raw: List[str]) -> argparse.Namespace:
3740
action="store_true",
3841
)
3942

40-
parser.add_argument("--format", choices=["csv", "json"], default="csv")
43+
parser.add_argument("--format", choices=["csv", "json", "text"], default="csv")
4144

4245
parser.add_argument("--types", nargs="+")
4346

@@ -109,6 +112,9 @@ def main(args_raw: List[str] = sys.argv[1:]) -> None:
109112
include_attrs=args.include_attrs,
110113
exclude_attrs=args.exclude_attrs,
111114
)
115+
elif args.format == "text":
116+
for page in pdf.pages:
117+
print(page.extract_text(layout=True))
112118
else:
113119
pdf.to_json(
114120
sys.stdout,

pdfplumber/container.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,13 @@ def to_csv(
170170

171171
cols = CSV_COLS_REQUIRED + list(filter(serializer.attr_filter, non_req_cols))
172172

173-
w = csv.DictWriter(stream, fieldnames=cols, extrasaction="ignore")
173+
w = csv.DictWriter(
174+
stream,
175+
fieldnames=cols,
176+
extrasaction="ignore",
177+
quoting=csv.QUOTE_MINIMAL,
178+
escapechar="\\",
179+
)
174180
w.writeheader()
175181
w.writerows(serialized)
176182

pdfplumber/convert.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,8 +109,8 @@ def do_dict(self, obj: Dict[str, Any]) -> Dict[str, Any]:
109109
else:
110110
return {k: self.serialize(v) for k, v in obj.items()}
111111

112-
def do_PDFStream(self, obj: Any) -> Dict[str, str]:
113-
return {"rawdata": to_b64(obj.rawdata)}
112+
def do_PDFStream(self, obj: Any) -> Dict[str, Optional[str]]:
113+
return {"rawdata": to_b64(obj.rawdata) if obj.rawdata else None}
114114

115115
def do_PSLiteral(self, obj: PSLiteral) -> str:
116116
return decode_text(obj.name)

pdfplumber/display.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from . import utils
1010
from ._typing import T_bbox, T_num, T_obj, T_obj_list, T_point, T_seq
1111
from .table import T_table_settings, Table, TableFinder, TableSettings
12+
from .utils.exceptions import MalformedPDFException
1213

1314
if TYPE_CHECKING: # pragma: nocover
1415
import pandas as pd
@@ -52,7 +53,11 @@ def get_page_image(
5253
stream.seek(0)
5354
src = stream
5455

55-
pdfium_doc = pypdfium2.PdfDocument(src, password=password)
56+
try:
57+
pdfium_doc = pypdfium2.PdfDocument(src, password=password)
58+
except pypdfium2.PdfiumError as e:
59+
raise MalformedPDFException(e)
60+
5661
pdfium_page = pdfium_doc.get_page(page_ix)
5762

5863
img: PIL.Image.Image = pdfium_page.render(
@@ -64,8 +69,6 @@ def get_page_image(
6469
# Non-modifiable arguments
6570
prefer_bgrx=True,
6671
).to_pil()
67-
# In theory `autoclose` when creating it should make it close...
68-
# automatically. In practice this does not seem to be the case.
6972
pdfium_doc.close()
7073

7174
return img.convert("RGB")
@@ -334,12 +337,17 @@ def debug_table(
334337
return self
335338

336339
def debug_tablefinder(
337-
self, tf: Optional[Union[TableFinder, TableSettings, T_table_settings]] = None
340+
self,
341+
table_settings: Optional[
342+
Union[TableFinder, TableSettings, T_table_settings]
343+
] = None,
338344
) -> "PageImage":
339-
if isinstance(tf, TableFinder):
340-
finder = tf
341-
elif tf is None or isinstance(tf, (TableSettings, dict)):
342-
finder = self.page.debug_tablefinder(tf)
345+
if isinstance(table_settings, TableFinder):
346+
finder = table_settings
347+
elif table_settings is None or isinstance(
348+
table_settings, (TableSettings, dict)
349+
):
350+
finder = self.page.debug_tablefinder(table_settings)
343351
else:
344352
raise ValueError(
345353
"Argument must be instance of TableFinder"

pdfplumber/page.py

Lines changed: 34 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import numbers
12
import re
23
from functools import lru_cache
34
from typing import (
@@ -13,6 +14,7 @@
1314
Union,
1415
)
1516
from unicodedata import normalize as normalize_unicode
17+
from warnings import warn
1618

1719
from pdfminer.converter import PDFPageAggregator
1820
from pdfminer.layout import (
@@ -34,6 +36,7 @@
3436
from .structure import PDFStructTree, StructTreeMissing
3537
from .table import T_table_settings, Table, TableFinder, TableSettings
3638
from .utils import decode_text, resolve_all, resolve_and_decode
39+
from .utils.exceptions import MalformedPDFException, PdfminerException
3740
from .utils.text import TextMap
3841

3942
lt_pat = re.compile(r"^LT")
@@ -64,6 +67,7 @@
6467
"stroke",
6568
"stroking_color",
6669
"stream",
70+
"name",
6771
"mcid",
6872
"tag",
6973
]
@@ -96,29 +100,6 @@ def fix_fontname_bytes(fontname: bytes) -> str:
96100
return str(prefix)[2:-1] + suffix_new
97101

98102

99-
def separate_pattern(
100-
color: Tuple[Any, ...]
101-
) -> Tuple[Optional[Tuple[Union[float, int], ...]], Optional[str]]:
102-
if isinstance(color[-1], PSLiteral):
103-
return (color[:-1] or None), decode_text(color[-1].name)
104-
else:
105-
return color, None
106-
107-
108-
def normalize_color(
109-
color: Any,
110-
) -> Tuple[Optional[Tuple[Union[float, int], ...]], Optional[str]]:
111-
if color is None:
112-
return (None, None)
113-
elif isinstance(color, tuple):
114-
tuplefied = color
115-
elif isinstance(color, list):
116-
tuplefied = tuple(color)
117-
else:
118-
tuplefied = (color,)
119-
return separate_pattern(tuplefied)
120-
121-
122103
def tuplify_list_kwargs(kwargs: Dict[str, Any]) -> Dict[str, Any]:
123104
return {
124105
key: (tuple(value) if isinstance(value, list) else value)
@@ -182,6 +163,10 @@ def _normalize_box(box_raw: T_bbox, rotation: T_num = 0) -> T_bbox:
182163
# conventionally specified by their lower-left and upperright
183164
# corners, it is acceptable to specify any two diagonally opposite
184165
# corners."
166+
if not all(isinstance(x, numbers.Number) for x in box_raw): # pragma: nocover
167+
raise MalformedPDFException(
168+
f"Bounding box contains non-number coordinate(s): {box_raw}"
169+
)
185170
x0, x1 = sorted((box_raw[0], box_raw[2]))
186171
y0, y1 = sorted((box_raw[1], box_raw[3]))
187172
if rotation in [90, 270]:
@@ -231,11 +216,14 @@ def get_attr(key: str, default: Any = None) -> Any:
231216

232217
self.mediabox = _invert_box(mb_raw, mb_height)
233218

234-
if "CropBox" in page_obj.attrs:
235-
self.cropbox = _invert_box(
236-
_normalize_box(get_attr("CropBox"), self.rotation), mb_height
237-
)
238-
else:
219+
for box_name in ["CropBox", "TrimBox", "BleedBox", "ArtBox"]:
220+
if box_name in page_obj.attrs:
221+
box_normalized = _invert_box(
222+
_normalize_box(get_attr(box_name), self.rotation), mb_height
223+
)
224+
setattr(self, box_name.lower(), box_normalized)
225+
226+
if "CropBox" not in page_obj.attrs:
239227
self.cropbox = self.mediabox
240228

241229
# Page.bbox defaults to self.mediabox, but can be altered by Page.crop(...)
@@ -274,7 +262,10 @@ def layout(self) -> LTPage:
274262
laparams=self.pdf.laparams,
275263
)
276264
interpreter = PDFPageInterpreter(self.pdf.rsrcmgr, device)
277-
interpreter.process_page(self.page_obj)
265+
try:
266+
interpreter.process_page(self.page_obj)
267+
except Exception as e:
268+
raise PdfminerException(e)
278269
self._layout: LTPage = device.get_result()
279270
return self._layout
280271

@@ -306,7 +297,15 @@ def parse(annot: T_obj) -> T_obj:
306297
try:
307298
extras[k] = v.decode("utf-8")
308299
except UnicodeDecodeError:
309-
extras[k] = v.decode("utf-16")
300+
try:
301+
extras[k] = v.decode("utf-16")
302+
except UnicodeDecodeError:
303+
if self.pdf.raise_unicode_errors:
304+
raise
305+
warn(
306+
f"Could not decode {k} of annotation."
307+
f" {k} will be missing."
308+
)
310309

311310
parsed = {
312311
"page_number": self.page_number,
@@ -376,13 +375,6 @@ def process_attr(item: Tuple[str, Any]) -> Optional[Tuple[str, Any]]:
376375
if hasattr(obj, cs):
377376
attr[cs] = resolve_and_decode(getattr(obj, cs).name)
378377

379-
for color_attr, pattern_attr in [
380-
("stroking_color", "stroking_pattern"),
381-
("non_stroking_color", "non_stroking_pattern"),
382-
]:
383-
if color_attr in attr:
384-
attr[color_attr], attr[pattern_attr] = normalize_color(attr[color_attr])
385-
386378
if isinstance(obj, (LTChar, LTTextContainer)):
387379
text = obj.get_text()
388380
attr["text"] = (
@@ -396,15 +388,15 @@ def process_attr(item: Tuple[str, Any]) -> Optional[Tuple[str, Any]]:
396388
# directly expose .stroking_color and .non_stroking_color
397389
# for LTChar objects (unlike, e.g., LTRect objects).
398390
gs = obj.graphicstate
399-
attr["stroking_color"], attr["stroking_pattern"] = normalize_color(
400-
gs.scolor
391+
attr["stroking_color"] = (
392+
gs.scolor if isinstance(gs.scolor, tuple) else (gs.scolor,)
401393
)
402-
attr["non_stroking_color"], attr["non_stroking_pattern"] = normalize_color(
403-
gs.ncolor
394+
attr["non_stroking_color"] = (
395+
gs.ncolor if isinstance(gs.ncolor, tuple) else (gs.ncolor,)
404396
)
405397

406398
# Handle (rare) byte-encoded fontnames
407-
if isinstance(attr["fontname"], bytes):
399+
if isinstance(attr["fontname"], bytes): # pragma: nocover
408400
attr["fontname"] = fix_fontname_bytes(attr["fontname"])
409401

410402
elif isinstance(obj, (LTCurve,)):

0 commit comments

Comments
 (0)