Skip to content

Commit d570f46

Browse files
authored
Fix sort_page_element. ensures that sorting is stable and not random. (#3978)
The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.
1 parent dfa17bd commit d570f46

3 files changed

Lines changed: 24 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55
### Features
66

77
### Fixes
8+
- The sort_page_element() use the element id to sort the elements.
9+
Two executions of the same code, on the same file, produce different results. The order of the elements is random.
10+
This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.
811
- **Do not use NLP to determine element types for extracted elements with hi_res.** This avoids extraneous Title elements in hi_res outputs. This only applies to *extracted* elements, meaning text objects that are found outside of Object Detection objects which get mapped to *inferred* elements. (*extracted* and *inferred* elements get merged together to form the list of `Element`s returned by `pdf_partition()`)
912

1013
## 0.17.5

test_unstructured/partition/pdf_image/test_pdf.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1603,3 +1603,24 @@ def test_partition_pdf_with_specified_ocr_agents(mocker):
16031603

16041604
assert spy.call_args_list[0][1] == {"language": "eng", "ocr_agent_module": OCR_AGENT_TESSERACT}
16051605
assert spy.call_args_list[1][1] == {"language": "en", "ocr_agent_module": OCR_AGENT_PADDLE}
1606+
1607+
1608+
def test_reproductible_pdf_loader():
1609+
from glob import glob
1610+
1611+
for f in glob(example_doc_path("pdf/layout-parser-paper.pdf")):
1612+
elements_1 = pdf.partition_pdf(
1613+
filename=f,
1614+
strategy=PartitionStrategy.AUTO,
1615+
infer_table_structure=False,
1616+
)
1617+
for _ in range(4):
1618+
elements_2 = pdf.partition_pdf(
1619+
filename=f,
1620+
strategy=PartitionStrategy.AUTO,
1621+
infer_table_structure=False,
1622+
)
1623+
for e1, e2 in zip(elements_1, elements_2):
1624+
assert e1.text == e2.text, f"load two time {f=} return differents results"
1625+
else:
1626+
break

unstructured/partition/utils/sorting.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,6 @@ def _coords_ok(strict_points: bool):
179179
key=lambda el: (
180180
el.metadata.coordinates.points[0][1] if el.metadata.coordinates else float("inf"),
181181
el.metadata.coordinates.points[0][0] if el.metadata.coordinates else float("inf"),
182-
el.id,
183182
),
184183
)
185184
else:

0 commit comments

Comments
 (0)