Explanation
As explained in #2536 , some PDFs are generated in a way which makes all images in the file being referenced by each page. This means that, when looping over images in a page, all images in the PDF are returned without checking whether each image is actually displayed in the page. As an example, take a DOCX document and convert it to PDF via LibreOffice (this can be done programmatically via the soffice CLI in batch conversions).
As @pubpub-zz pointed out in the discussion, this default behaviour must be kept for different reasons, so it would be better to have a function like is_image_displayed(image_id:str) which tells whether the given image is displayed.
Code Example
[IMPORTANT] The following script is the result of a brainstorming with an AI (via OpenCode), so take this as an example and not a definitive solution. Also:
- it was asked to implement the function as a standalone function for validation purposes
- this currently takes care of images referenced by XObjects. Inline images, cache and any other feature is to be added later
- this was inspired by @stefan6419846 's comment "The ImageFile class knows its corresponding objects, arbitrary objects can be retrieved from the associated PdfDocCommon instance."
How would your feature be used? (Remove this if it is not applicable.)
from pypdf import PdfReader
from pypdf.generic._data_structures import ContentStream
from typing import Union
def is_image_displayed(
pdf_reader: 'PdfReader',
page_number: int,
image_id: str
) -> bool:
"""
Check if an XObject image is actually used on a PDF page.
Only checks direct XObjects (no nested forms/inline images).
Args:
pdf_reader: PdfReader instance (PdfDocCommon base class)
page_number: Page number (0-based index)
image_id: Image identifier (e.g., '/I0', 'Image1')
Returns:
True if image appears in content stream Do operators, False otherwise
Note:
TODO: Cache content stream operations to avoid reparsing
"""
# Get page
page = pdf_reader.pages[page_number]
# Access raw /Contents directly (get_contents() may return empty)
raw_contents = page.get('/Contents', None)
if not raw_contents:
return False
# Parse operations from raw stream
contents = ContentStream(raw_contents, pdf_reader)
# Check each image on this page for Do operator usage
for img in page.images:
# Stefan's approach: Use indirect_reference to verify
if img.name == image_id:
# Look for "Do /XObjectName" in operations
for operands, operator in contents.operations:
if operator == b"Do":
xobj_name = str(operands[0])
# Compare base names (without extension like .jp2)
img_base = img.name.split('.')[0].lstrip('/')
xobj_base = xobj_name.lstrip('/')
if img_base == xobj_base:
return True # Image is displayed!
return False
return False
if __name__ == "__main__":
reader = PdfReader("example.pdf")
print(f"Total pages: {len(reader.pages)}")
print("\n--- Testing all images ---\n")
for page_num, page in enumerate(reader.pages):
print(f"\n=== Page {page_num + 1} ===")
total_images = list(page.images)
displayed = []
not_displayed = []
for img in total_images:
is_used = is_image_displayed(reader, page_num, img.name)
if is_used:
displayed.append(img.name)
else:
not_displayed.append(img.name)
if total_images:
print(f"\nTotal images referenced: {len(total_images)}")
print(f"Images displayed: {displayed}")
if not_displayed:
print(f"Not displayed: {not_displayed}")
As a validation example, this PDF contains an image in page 1 and an image in page 3 but since it has been converted from DOCX via LibreOffice, image references are repeated across pages. With this method, only the first and last pages contain an image each
example.pdf
Explanation
As explained in #2536 , some PDFs are generated in a way which makes all images in the file being referenced by each page. This means that, when looping over images in a page, all images in the PDF are returned without checking whether each image is actually displayed in the page. As an example, take a DOCX document and convert it to PDF via LibreOffice (this can be done programmatically via the
sofficeCLI in batch conversions).As @pubpub-zz pointed out in the discussion, this default behaviour must be kept for different reasons, so it would be better to have a function like
is_image_displayed(image_id:str)which tells whether the given image is displayed.Code Example
[IMPORTANT] The following script is the result of a brainstorming with an AI (via OpenCode), so take this as an example and not a definitive solution. Also:
How would your feature be used? (Remove this if it is not applicable.)
As a validation example, this PDF contains an image in page 1 and an image in page 3 but since it has been converted from DOCX via LibreOffice, image references are repeated across pages. With this method, only the first and last pages contain an image each
example.pdf