Skip to content

Page.get_images_info() is returning empty array. Need a reliable way to detect non-text PDF pages #5017

@manikantaaddagatla

Description

@manikantaaddagatla

Description of the bug

Hi Team,

i have page where

  1. page.get_images_info() is returning empty array whereas pymupdf4llm.to_markdown is giving 2 page blocks with class 'picture'
  2. page.get_text() is returning null.

PFA pdf file.
Thanks of the support.

bug_page1.pdf

How to reproduce the bug

Run below python code

import pymupdf
import pymupdf4llm

input_path = "<path_to_file>"
doc = pymupdf.open(input_path)

for page in doc:
    print(f"text {page.get_text()}")

    image_infos = page.get_image_info(xrefs=True)
    print(f"page number {page.number} image_infos : {image_infos}")

documentText = pymupdf4llm.to_markdown(doc,page_chunks = True)
print(f"document text : {documentText}")

versions:
PyMuPDF 1.27.2.2
pymupdf-layout 1.27.2.2
pymupdf4llm 1.27.2.2

PyMuPDF version

1.27.2.2

Operating system

MacOS

Python version

3.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions