Skip to content

Commit 199f255

Browse files
feat: Track the table extraction method (#4346)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Low Risk** > Adds a new optional metadata field and wires it through PDF partitioning; behavior is additive and low risk aside from potential downstream consumers expecting a fixed metadata schema. > > **Overview** > **Adds table-provenance metadata for extracted tables.** `ElementMetadata` now includes an optional `table_extraction_method` (e.g., `grid`, `tatr`, `vlm`) and includes it in metadata consolidation. > > During PDF partitioning, the value is propagated from each `LayoutElement` onto the resulting element metadata, enabling downstream consumers to identify which table-extraction algorithm produced a given table. Version is bumped to `0.22.26` and the changelog is updated accordingly. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit fe6135a. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent f51769b commit 199f255

4 files changed

Lines changed: 14 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.22.26
2+
3+
### Enhancements
4+
5+
- Add `table_extraction_method` field to `ElementMetadata` to track which algorithm produced a table (grid, tatr, vlm). Propagated from `LayoutElement` during PDF partitioning.
6+
17
## 0.22.25
28

39
### Enhancements

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.22.25" # pragma: no cover
1+
__version__ = "0.22.26" # pragma: no cover

unstructured/documents/elements.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,7 @@ class ElementMetadata:
213213
text_as_html: Optional[str]
214214
is_extracted: Optional[str]
215215
table_as_cells: Optional[dict[str, str | int]]
216+
table_extraction_method: Optional[str] # "grid", "tatr", or "vlm"
216217

217218
# -- used for TableChunk elements to enable table reconstruction --
218219
table_id: Optional[str]
@@ -267,6 +268,7 @@ def __init__(
267268
signature: Optional[str] = None,
268269
subject: Optional[str] = None,
269270
table_as_cells: Optional[dict[str, str | int]] = None,
271+
table_extraction_method: Optional[str] = None,
270272
table_id: Optional[str] = None,
271273
chunk_index: Optional[int] = None,
272274
num_carried_over_header_rows: Optional[int] = None,
@@ -320,6 +322,7 @@ def __init__(
320322
self.subject = subject
321323
self.text_as_html = text_as_html
322324
self.table_as_cells = table_as_cells
325+
self.table_extraction_method = table_extraction_method
323326
self.table_id = table_id
324327
self.chunk_index = chunk_index
325328
self.num_carried_over_header_rows = num_carried_over_header_rows
@@ -548,6 +551,7 @@ def field_consolidation_strategies(cls) -> dict[str, ConsolidationStrategy]:
548551
"subject": cls.FIRST,
549552
"text_as_html": cls.STRING_CONCATENATE,
550553
"table_as_cells": cls.FIRST, # -- only occurs in Table --
554+
"table_extraction_method": cls.FIRST,
551555
"table_id": cls.DROP, # -- added by chunking, not before --
552556
"chunk_index": cls.DROP, # -- added by chunking, not before --
553557
"num_carried_over_header_rows": cls.DROP, # -- added by chunking, not before --

unstructured/partition/pdf.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1443,6 +1443,9 @@ def document_to_element_list(
14431443
element.metadata.last_modified = last_modification_date
14441444
element.metadata.text_as_html = getattr(layout_element, "text_as_html", None)
14451445
element.metadata.table_as_cells = getattr(layout_element, "table_as_cells", None)
1446+
element.metadata.table_extraction_method = getattr(
1447+
layout_element, "table_extraction_method", None
1448+
)
14461449

14471450
if (isinstance(element, Title) and element.metadata.category_depth is None) and (
14481451
has_headline

0 commit comments

Comments
 (0)