Skip to content

Header row(s) not extracting properly #46

@bhoppeadoy

Description

@bhoppeadoy

I'm trying to extract the table in this PDF: ads7828-4.pdf

Here's what I'm doing:

from gmft.auto import AutoTableDetector, AutoTableFormatter
from gmft.auto import AutoFormatConfig, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document

detector = AutoTableDetector()
config = AutoFormatConfig()
config.verbosity = 3
config.semantic_spanning_cells = True
#config.enable_multi_header = True
#config.large_table_assumption = True
formatter = AutoTableFormatter(config)

doc = PyPDFium2Document("ads7828-4.pdf")

tables = []
for page in doc:
    tables += detector.extract(page)

from IPython.display import display

ft = formatter.extract(tables[0])
display(ft.visualize(filter=[5]))
f = ft.df()

print("\nTable Headers:")
print(df.columns.tolist())

Visualizing the header row, looks like its capturing several extra empty cells.

Image

And so the headers come out like so:

Image

What I want is for the headers to come out like this:

Image

I'm not sure what I can tweak to clean up how it detects the header cells. Any suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    structure accuracyissue related to recognizing table structure ("format")

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions