I'm trying to extract the table in this PDF: ads7828-4.pdf
Here's what I'm doing:
from gmft.auto import AutoTableDetector, AutoTableFormatter
from gmft.auto import AutoFormatConfig, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document
detector = AutoTableDetector()
config = AutoFormatConfig()
config.verbosity = 3
config.semantic_spanning_cells = True
#config.enable_multi_header = True
#config.large_table_assumption = True
formatter = AutoTableFormatter(config)
doc = PyPDFium2Document("ads7828-4.pdf")
tables = []
for page in doc:
tables += detector.extract(page)
from IPython.display import display
ft = formatter.extract(tables[0])
display(ft.visualize(filter=[5]))
f = ft.df()
print("\nTable Headers:")
print(df.columns.tolist())
Visualizing the header row, looks like its capturing several extra empty cells.

And so the headers come out like so:

What I want is for the headers to come out like this:

I'm not sure what I can tweak to clean up how it detects the header cells. Any suggestions?
I'm trying to extract the table in this PDF: ads7828-4.pdf
Here's what I'm doing:
Visualizing the header row, looks like its capturing several extra empty cells.
And so the headers come out like so:
What I want is for the headers to come out like this:
I'm not sure what I can tweak to clean up how it detects the header cells. Any suggestions?