Skip to content

Commit e3417d7

Browse files
authored
fix: Fix for Pillow error when extracting PNG images (#3998)
When I tried to partition a PNG file and extract images, I got an error from Pillow: ``` WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image Traceback (most recent call last): File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save rawmode = RAWMODE[im.mode] KeyError: 'RGBA' ``` The issue is that a PNG has an additional layer that cannot be saved off in jpeg format. We can fix this with a quick conversion. I added a png test case that is now passing with this fix.
1 parent b814ece commit e3417d7

4 files changed

Lines changed: 16 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
## 0.17.7-dev0
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
- **Fix image extraction for PNG files.** When `extract_image_block_to_payload` is True, and the image is a PNG, we get a Pillow error. We need to remove the PNG transparency layer before saving the image.
9+
110
## 0.17.6
211

312
### Enhancements

test_unstructured/partition/pdf_image/test_pdf_image_utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ def test_convert_pdf_to_image_raises_error(filename=example_doc_path("embedded-i
7373
[
7474
(example_doc_path("pdf/layout-parser-paper-fast.pdf"), False),
7575
(example_doc_path("img/layout-parser-paper-fast.jpg"), True),
76+
(example_doc_path("img/english-and-korean.png"), True),
7677
],
7778
)
7879
@pytest.mark.parametrize("element_category_to_save", [ElementType.IMAGE, ElementType.TABLE])

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.17.6" # pragma: no cover
1+
__version__ = "0.17.7-dev0" # pragma: no cover

unstructured/partition/pdf_image/pdf_image_utils.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,11 @@ def save_elements(
204204
image_path = image_paths[page_index]
205205
image = Image.open(image_path)
206206
cropped_image = image.crop(padded_bbox)
207+
208+
# PNG images with transparency need to be converted before saving
209+
if cropped_image.mode == "RGBA":
210+
cropped_image = cropped_image.convert("RGB")
211+
207212
if extract_image_block_to_payload:
208213
buffered = BytesIO()
209214
cropped_image.save(buffered, format="JPEG")

0 commit comments

Comments
 (0)