Skip to content

bug/incorrect text extraction by partition_pdf with hi_res strategy #4092

@VishwaRajput

Description

@VishwaRajput

Describe the bug
The text element is not exactly as written in pdf.
I have a pdf which consist tables. I am extracting elements for my RAG application with partition_pdf function - hi_res with yolox. It seems a simple text and it repeats in whole pdf but model seems miss one particular spot where actual text is "AUTOSAR Administration" and the element text returned by partition_pdf is "Teton".

To Reproduce
elements = partition_pdf(
filename="mypdf.pdf",
strategy="hi_res",
infer_table_structure=True,
model_name = "yolox"
)

for i, element in enumerate(elements):
print(f"\nElement {i+1}:")
print(f" Page Number: {element.metadata.page_number}")
print(f" Type: {type(element).name}")
print(f" Text: {element.text}")

if isinstance(element, Table):
    print(f"  This is a Table element. \n {element.metadata.text_as_html}")
elif isinstance(element, Title):
    print(f"  This is a Title element. Category Depth: {element.metadata.category_depth}")
elif isinstance(element, NarrativeText):
    print("  This is a Narrative Text element.")
elif isinstance(element, ListItem):
    print("  This is a List Item element.")

Expected behavior
Below should be the table html

2007-01-31 |2.1.0AUTOSAR Administratione Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and
2006-05-18 |2.0.0AUTOSAR AdministrationSecond
2005-05-09| 1.0.0| AUTOSAR AdministrationInitial release

But instead it is as below (The "Teton" inside in the first line)

2007-01-31 |2.1.0Tetone Harmonization of the document with other specifications (e.g. RTE) e Introduction of a new concept to support calibration and measurement - harmonized with RTE e Description of needs of the Software Component Template toward AUTOSAR services and of the interaction of the Software Component Template and
2006-05-18 |2.0.0AUTOSAR AdministrationSecond
2005-05-09| 1.0.0| AUTOSAR AdministrationInitial release

Screenshots
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions