Skip to content

Commit 6689db0

Browse files
awalker4claude
andauthored
fix(PLU-373): make notion database test row-order insensitive (#714)
## Summary - Notion's database query endpoint doesn't guarantee stable row order, but `test_notion_source_database` compared the downloaded HTML byte-for-byte. Every prior "fix" (#693, #605) was just reshuffling the same four rows. - Add `unordered_table_html_equality_check` that compares the header positionally and data rows as a multiset of their text content, and wire it into the database test only (the page test isn't a table). - On mismatch the new check prints the symmetric difference so the next real regression won't be a black box like https://github.com/Unstructured-IO/unstructured-ingest/actions/runs/26238030726/job/77216703553 was. - Version bumped to `1.6.2-dev` to satisfy `check-version` without claiming a release slot for a test-only change. ## Test plan - [x] `uncategorized_connectors_int_test` Notion tests pass (real verification — needs the CI's `NOTION_API_KEY`) - [x] Smoke-tested locally against the real fixture: reordered rows compare equal, a tampered row fails with a readable diff <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Make the Notion database integration test row-order insensitive and correct a misleading helper comment, eliminating flaky failures when Notion returns rows in different orders. Addresses PLU-373. - **Bug Fixes** - Added `unordered_table_html_equality_check` to compare the header positionally and treat table rows as a multiset; prints a readable symmetric diff on mismatch. - Used in `test_notion_source_database` only; the page test remains unchanged. - Fixed the helper comment to state it scans all `<tr>` elements, not just the first table. - Bumped version to `1.6.6-dev` and updated `CHANGELOG.md`; test-only change with no runtime impact. <sup>Written for commit 87d92cc. Summary will update on new commits. <a href="https://cubic.dev/pr/Unstructured-IO/unstructured-ingest/pull/714?utm_source=github">Review in cubic</a></sup> <!-- End of auto-generated description by cubic. --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2f899f5 commit 6689db0

4 files changed

Lines changed: 55 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## [1.6.6-dev]
2+
3+
### Fixes
4+
5+
- **test(notion): make `test_notion_source_database` row-order insensitive.** Test-only change; no published behavior.
6+
17
## [1.6.5]
28

39
### Fixes

test/integration/connectors/test_notion.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
import pytest
44

55
from test.integration.connectors.utils.constants import SOURCE_TAG, UNCATEGORIZED_TAG
6+
from test.integration.connectors.utils.validation.equality import (
7+
unordered_table_html_equality_check,
8+
)
69
from test.integration.connectors.utils.validation.source import (
710
SourceValidationConfigs,
811
get_all_file_data,
@@ -59,6 +62,7 @@ def test_notion_source_database(temp_dir):
5962
exclude_fields_extend=["metadata.date_created", "metadata.date_modified"],
6063
predownload_file_data_check=source_filedata_display_name_set_check,
6164
postdownload_file_data_check=source_filedata_display_name_set_check,
65+
file_equality_check=unordered_table_html_equality_check,
6266
),
6367
)
6468

test/integration/connectors/utils/validation/equality.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import json
2+
from collections import Counter
23
from pathlib import Path
34

45
from bs4 import BeautifulSoup
@@ -47,6 +48,49 @@ def html_equality_check(expected_filepath: Path, current_filepath: Path) -> bool
4748
return expected_soup.text == current_soup.text
4849

4950

51+
def unordered_table_html_equality_check(
52+
expected_filepath: Path, current_filepath: Path
53+
) -> bool:
54+
# Equality check for HTML files whose rows arrive in arbitrary order.
55+
# The first <tr> in the document is compared positionally as a header;
56+
# remaining <tr>s are compared as a multiset of their text content. Used
57+
# for connectors whose upstream API doesn't guarantee stable row ordering
58+
# (e.g. Notion's database query response).
59+
with expected_filepath.open() as expected_f:
60+
expected_soup = BeautifulSoup(expected_f, "html.parser")
61+
with current_filepath.open() as current_f:
62+
current_soup = BeautifulSoup(current_f, "html.parser")
63+
64+
def split_rows(soup: BeautifulSoup) -> tuple[str, list[str]]:
65+
rows = soup.find_all("tr")
66+
if not rows:
67+
return "", []
68+
header = rows[0].get_text(" ", strip=True)
69+
data = sorted(r.get_text(" ", strip=True) for r in rows[1:])
70+
return header, data
71+
72+
expected_header, expected_data = split_rows(expected_soup)
73+
current_header, current_data = split_rows(current_soup)
74+
75+
if expected_header != current_header:
76+
print("table header differs:")
77+
print(f" expected: {expected_header}")
78+
print(f" current: {current_header}")
79+
return False
80+
if expected_data != current_data:
81+
expected_counts = Counter(expected_data)
82+
current_counts = Counter(current_data)
83+
only_in_expected = expected_counts - current_counts
84+
only_in_current = current_counts - expected_counts
85+
print("table rows differ (order-insensitive):")
86+
for row, n in only_in_expected.items():
87+
print(f" only in expected (x{n}): {row}")
88+
for row, n in only_in_current.items():
89+
print(f" only in current (x{n}): {row}")
90+
return False
91+
return True
92+
93+
5094
def txt_equality_check(expected_filepath: Path, current_filepath: Path) -> bool:
5195
with expected_filepath.open() as expected_f:
5296
expected_text_lines = expected_f.readlines()

unstructured_ingest/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "1.6.5" # pragma: no cover
1+
__version__ = "1.6.6-dev" # pragma: no cover

0 commit comments

Comments
 (0)