Skip to content

Extract images from specific pages with wrong references #3737

@andreasntr

Description

@andreasntr

Explanation

As explained in #2536 , some PDFs are generated in a way which makes all images in the file being referenced by each page. This means that, when looping over images in a page, all images in the PDF are returned without checking whether each image is actually displayed in the page. As an example, take a DOCX document and convert it to PDF via LibreOffice (this can be done programmatically via the soffice CLI in batch conversions).

As @pubpub-zz pointed out in the discussion, this default behaviour must be kept for different reasons, so it would be better to have a function like is_image_displayed(image_id:str) which tells whether the given image is displayed.

Code Example

[IMPORTANT] The following script is the result of a brainstorming with an AI (via OpenCode), so take this as an example and not a definitive solution. Also:

  • it was asked to implement the function as a standalone function for validation purposes
  • this currently takes care of images referenced by XObjects. Inline images, cache and any other feature is to be added later
  • this was inspired by @stefan6419846 's comment "The ImageFile class knows its corresponding objects, arbitrary objects can be retrieved from the associated PdfDocCommon instance."

How would your feature be used? (Remove this if it is not applicable.)

from pypdf import PdfReader
from pypdf.generic._data_structures import ContentStream
from typing import Union

def is_image_displayed(
    pdf_reader: 'PdfReader',
    page_number: int,
    image_id: str
) -> bool:
    """
    Check if an XObject image is actually used on a PDF page.
    
    Only checks direct XObjects (no nested forms/inline images).
    
    Args:
        pdf_reader: PdfReader instance (PdfDocCommon base class)
        page_number: Page number (0-based index)
        image_id: Image identifier (e.g., '/I0', 'Image1')
    
    Returns:
        True if image appears in content stream Do operators, False otherwise
    
    Note:
        TODO: Cache content stream operations to avoid reparsing
    """
    
    # Get page
    page = pdf_reader.pages[page_number]
    
    # Access raw /Contents directly (get_contents() may return empty)
    raw_contents = page.get('/Contents', None)
    if not raw_contents:
        return False
    
    # Parse operations from raw stream
    contents = ContentStream(raw_contents, pdf_reader)
    
    # Check each image on this page for Do operator usage
    for img in page.images:
        # Stefan's approach: Use indirect_reference to verify
        if img.name == image_id:
            # Look for "Do /XObjectName" in operations
            for operands, operator in contents.operations:
                if operator == b"Do":
                    xobj_name = str(operands[0])
                    
                    # Compare base names (without extension like .jp2)
                    img_base = img.name.split('.')[0].lstrip('/')
                    xobj_base = xobj_name.lstrip('/')
                    
                    if img_base == xobj_base:
                        return True  # Image is displayed!
            
            return False
    
    return False


if __name__ == "__main__":
    reader = PdfReader("example.pdf")
    
    print(f"Total pages: {len(reader.pages)}")
    print("\n--- Testing all images ---\n")
    
    for page_num, page in enumerate(reader.pages):
        print(f"\n=== Page {page_num + 1} ===")
        
        total_images = list(page.images)
        displayed = []
        not_displayed = []
        
        for img in total_images:
            is_used = is_image_displayed(reader, page_num, img.name)
            if is_used:
                displayed.append(img.name)
            else:
                not_displayed.append(img.name)
        
        if total_images:
            print(f"\nTotal images referenced: {len(total_images)}")
            print(f"Images displayed: {displayed}")
            if not_displayed:
                print(f"Not displayed: {not_displayed}")

As a validation example, this PDF contains an image in page 1 and an image in page 3 but since it has been converted from DOCX via LibreOffice, image references are repeated across pages. With this method, only the first and last pages contain an image each

example.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-featureA feature requestworkflow-imagesFrom a users perspective, image handling is the affected feature/workflow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions