Extract images from specific pages with wrong references

## Explanation

As explained in #2536 , some PDFs are generated in a way which makes all images in the file being referenced by each page. This means that, when looping over images in a page, all images in the PDF are returned without checking whether each image is actually displayed in the page. As an example, take a DOCX document and convert it to PDF via LibreOffice (this can be done programmatically via the `soffice` CLI in batch conversions). 

As @pubpub-zz pointed out in the discussion, this default behaviour must be kept for different reasons, so it would be better to have a function like `is_image_displayed(image_id:str)` which tells whether the given image is displayed.

## Code Example

[IMPORTANT] The following script is the result of a brainstorming with an AI (via OpenCode), so take this as an example and not a definitive solution. Also:
- it was asked to implement the function as a standalone function for validation purposes
- this currently takes care of images referenced by XObjects. Inline images, cache and any other feature is to be added later
- this was inspired by @stefan6419846 's comment "_The ImageFile class knows its corresponding objects, arbitrary objects can be retrieved from the associated PdfDocCommon instance._"

How would your feature be used? (Remove this if it is not applicable.)

```python
from pypdf import PdfReader
from pypdf.generic._data_structures import ContentStream
from typing import Union

def is_image_displayed(
    pdf_reader: 'PdfReader',
    page_number: int,
    image_id: str
) -> bool:
    """
    Check if an XObject image is actually used on a PDF page.
    
    Only checks direct XObjects (no nested forms/inline images).
    
    Args:
        pdf_reader: PdfReader instance (PdfDocCommon base class)
        page_number: Page number (0-based index)
        image_id: Image identifier (e.g., '/I0', 'Image1')
    
    Returns:
        True if image appears in content stream Do operators, False otherwise
    
    Note:
        TODO: Cache content stream operations to avoid reparsing
    """
    
    # Get page
    page = pdf_reader.pages[page_number]
    
    # Access raw /Contents directly (get_contents() may return empty)
    raw_contents = page.get('/Contents', None)
    if not raw_contents:
        return False
    
    # Parse operations from raw stream
    contents = ContentStream(raw_contents, pdf_reader)
    
    # Check each image on this page for Do operator usage
    for img in page.images:
        # Stefan's approach: Use indirect_reference to verify
        if img.name == image_id:
            # Look for "Do /XObjectName" in operations
            for operands, operator in contents.operations:
                if operator == b"Do":
                    xobj_name = str(operands[0])
                    
                    # Compare base names (without extension like .jp2)
                    img_base = img.name.split('.')[0].lstrip('/')
                    xobj_base = xobj_name.lstrip('/')
                    
                    if img_base == xobj_base:
                        return True  # Image is displayed!
            
            return False
    
    return False


if __name__ == "__main__":
    reader = PdfReader("example.pdf")
    
    print(f"Total pages: {len(reader.pages)}")
    print("\n--- Testing all images ---\n")
    
    for page_num, page in enumerate(reader.pages):
        print(f"\n=== Page {page_num + 1} ===")
        
        total_images = list(page.images)
        displayed = []
        not_displayed = []
        
        for img in total_images:
            is_used = is_image_displayed(reader, page_num, img.name)
            if is_used:
                displayed.append(img.name)
            else:
                not_displayed.append(img.name)
        
        if total_images:
            print(f"\nTotal images referenced: {len(total_images)}")
            print(f"Images displayed: {displayed}")
            if not_displayed:
                print(f"Not displayed: {not_displayed}")
```

As a validation example, this PDF contains an image in page 1 and an image in page 3 but since it has been converted from DOCX via LibreOffice, image references are repeated across pages. With this method, only the first and last pages contain an image each

[example.pdf](https://github.com/user-attachments/files/26871971/example.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract images from specific pages with wrong references #3737

Explanation

Code Example

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Extract images from specific pages with wrong references #3737

Description

Explanation

Code Example

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions