Skip to content

Commit a40df61

Browse files
authored
Migrate to uv, drop 3.9 and 3.10, fix tests (#335)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Medium risk because it changes packaging/build tooling and CI execution (Poetry→uv, setuptools build) and adjusts split-PDF hook timeout/cleanup behavior, which can affect test stability and request handling. > > **Overview** > Migrates the project from Poetry to `uv`: CI now installs via `uv sync --locked`, the Makefile runs lint/tests with `uv run`, and publishing is switched to `uv build`/`uv publish` with a hardened `scripts/publish.sh` (strict bash + Python >=3.11 guard). Python support is narrowed to 3.11+ (CI matrix and `pylintrc`), dependency versions are updated, and `poetry.lock`/`poetry.toml` are removed in favor of a setuptools-based `pyproject.toml` with dynamic versioning. > > Improves split-PDF behavior and test robustness: the split hook now propagates request timeouts into chunk requests, scales the outer future timeout by concurrency “waves”, and ensures per-operation state is cleaned up on both success and dummy-request failures; corresponding unit/integration tests were updated (including relaxed equivalence checks for `hi_res` OCR outputs and longer client timeouts). Adds regression-guard unit tests to enforce key packaging/CI/publish invariants and multipart file serialization, and removes an unused/disabled encryption test suite. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3e0d3d3. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 0571f93 commit a40df61

File tree

18 files changed

+1380
-1703
lines changed

18 files changed

+1380
-1703
lines changed

.github/workflows/ci.yaml

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,67 +19,95 @@ concurrency:
1919
jobs:
2020
test_unit:
2121
strategy:
22+
fail-fast: false
2223
matrix:
23-
python-version: [ "3.9","3.10","3.11", "3.12" ]
24+
python-version: [ "3.11", "3.12", "3.13" ]
2425
runs-on: ubuntu-latest
2526
steps:
2627
- uses: actions/checkout@v4
28+
- uses: astral-sh/setup-uv@v6
2729
- name: Set up Python ${{ matrix.python-version }}
2830
uses: actions/setup-python@v5
2931
with:
3032
python-version: ${{ matrix.python-version }}
3133
- name: Install dependencies
34+
env:
35+
UV_LOCKED: "1"
36+
UV_PYTHON: ${{ matrix.python-version }}
3237
run: make install
3338
- name: Run unit tests
39+
env:
40+
UV_PYTHON: ${{ matrix.python-version }}
3441
run: |
3542
make test-unit
3643
3744
lint:
3845
runs-on: ubuntu-latest
3946
steps:
4047
- uses: actions/checkout@v4
48+
- uses: astral-sh/setup-uv@v6
49+
- name: Set up Python 3.13
50+
uses: actions/setup-python@v5
51+
with:
52+
python-version: "3.13"
4153
- name: Install dependencies
54+
env:
55+
UV_LOCKED: "1"
56+
UV_PYTHON: "3.13"
4257
run: make install
4358
- name: Lint
59+
env:
60+
UV_PYTHON: "3.13"
4461
run: |
4562
make lint
4663
4764
test_integration:
4865
strategy:
66+
fail-fast: false
4967
matrix:
50-
python-version: [ "3.9","3.10","3.11", "3.12" ]
51-
runs-on: ubuntu-latest
68+
python-version: [ "3.11", "3.12", "3.13" ]
69+
runs-on: opensource-linux-8core
5270
steps:
5371
- uses: actions/checkout@v4
72+
- uses: astral-sh/setup-uv@v6
5473
- name: Set up Python ${{ matrix.python-version }}
5574
uses: actions/setup-python@v5
5675
with:
5776
python-version: ${{ matrix.python-version }}
5877
- name: Install dependencies
78+
env:
79+
UV_LOCKED: "1"
80+
UV_PYTHON: ${{ matrix.python-version }}
5981
run: make install
6082
- name: Run integration tests
61-
run: |
62-
make test-integration-docker
6383
env:
84+
UV_PYTHON: ${{ matrix.python-version }}
6485
UNSTRUCTURED_API_KEY: ${{ secrets.UNSTRUCTURED_API_KEY }}
86+
run: |
87+
make test-integration-docker
6588
6689
test_contract:
6790
strategy:
91+
fail-fast: false
6892
matrix:
69-
python-version: [ "3.9","3.10","3.11", "3.12" ]
70-
runs-on: ubuntu-latest
71-
env:
72-
POETRY_VIRTUALENVS_IN_PROJECT: "true"
93+
python-version: [ "3.11", "3.12", "3.13" ]
94+
runs-on: opensource-linux-8core
7395
steps:
7496
- uses: actions/checkout@v4
97+
- uses: astral-sh/setup-uv@v6
7598
- name: Set up Python ${{ matrix.python-version }}
7699
uses: actions/setup-python@v5
77100
with:
78101
python-version: ${{ matrix.python-version }}
79102
- name: Install dependencies
103+
env:
104+
UV_LOCKED: "1"
105+
UV_PYTHON: ${{ matrix.python-version }}
80106
run: |
81107
make install
82108
- name: Run contract tests
109+
env:
110+
UV_PYTHON: ${{ matrix.python-version }}
83111
run: |
84112
make test-contract
85113

Makefile

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,8 @@ DOCKER_IMAGE ?= downloads.unstructured.io/unstructured-io/unstructured-api:lates
1010
## install: installs all test, dev, and experimental requirements
1111
.PHONY: install
1212
install:
13-
pip install -U poetry
1413
python scripts/prepare_readme.py
15-
poetry install
14+
uv sync --locked
1615

1716
## install-speakeasy-cli: download the speakeasy cli tool
1817
.PHONY: install-speakeasy-cli
@@ -28,30 +27,30 @@ test: test-unit test-integration-docker
2827

2928
.PHONY: test-unit
3029
test-unit:
31-
PYTHONPATH=. poetry run pytest -n auto _test_unstructured_client -v -k "unit"
30+
PYTHONPATH=. uv run pytest -n auto _test_unstructured_client -v -k "unit"
3231

3332
.PHONY: test-contract
3433
test-contract:
35-
PYTHONPATH=. poetry run pytest -n auto _test_contract -v
34+
PYTHONPATH=. uv run pytest -n auto _test_contract -v
3635

3736
# Assumes you have unstructured-api running on localhost:8000
3837
.PHONY: test-integration
3938
test-integration:
40-
PYTHONPATH=. poetry run pytest -n auto _test_unstructured_client -v -k "integration"
39+
PYTHONPATH=. uv run pytest -n auto _test_unstructured_client -v -k "integration"
4140

4241
# Runs the unstructured-api in docker for tests
4342
.PHONY: test-integration-docker
4443
test-integration-docker:
4544
-docker stop unstructured-api && docker kill unstructured-api
4645
docker run --name unstructured-api -p 8000:8000 -d --rm ${DOCKER_IMAGE} --host 0.0.0.0 && \
4746
curl -s -o /dev/null --retry 10 --retry-delay 5 --retry-all-errors http://localhost:8000/general/docs && \
48-
PYTHONPATH=. poetry run pytest -n auto _test_unstructured_client -v -k "integration" && \
47+
PYTHONPATH=. uv run pytest -n auto _test_unstructured_client -v -k "integration" && \
4948
docker kill unstructured-api
5049

5150
.PHONY: lint
5251
lint:
53-
poetry run pylint --rcfile=pylintrc src
54-
poetry run mypy src
52+
uv run pylint --rcfile=pylintrc src
53+
uv run mypy src
5554

5655
#############
5756
# Speakeasy #

_test_contract/conftest.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@
77

88
from unstructured_client import UnstructuredClient, utils
99

10-
# Python 3.9 workaround: eagerly import retries to avoid lazy import race condition
11-
# This prevents a KeyError in module lock when templates.py triggers lazy import of utils.retries
10+
# Eagerly import retries to avoid a lazy import race when templates.py first loads utils.retries.
1211
from unstructured_client.utils import retries # noqa: F401
1312

1413
FAKE_API_KEY = "91pmLBeETAbXCpNylRsLq11FdiZPTk"

_test_unstructured_client/integration/test_decorators.py

Lines changed: 100 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from __future__ import annotations
22

3+
from collections import Counter, defaultdict
4+
import math
35
import tempfile
46
from pathlib import Path
57
from typing import Literal
@@ -22,6 +24,95 @@
2224
from unstructured_client._hooks.custom import split_pdf_hook
2325

2426
FAKE_KEY = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
27+
TEST_TIMEOUT_MS = 300_000
28+
29+
_HI_RES_STRATEGIES = ("hi_res", Strategy.HI_RES)
30+
31+
32+
def _allowed_delta(expected: int, *, absolute: int, ratio: float) -> int:
33+
return max(absolute, math.ceil(expected * ratio))
34+
35+
36+
def _text_size(elements) -> int:
37+
return sum(len((element.get("text") or "").strip()) for element in elements)
38+
39+
40+
def _elements_by_page(elements):
41+
pages = defaultdict(list)
42+
for element in elements:
43+
pages[element["metadata"]["page_number"]].append(element)
44+
return pages
45+
46+
47+
def _assert_hi_res_output_is_similar(resp_split, resp_single):
48+
split_pages = _elements_by_page(resp_split.elements)
49+
single_pages = _elements_by_page(resp_single.elements)
50+
51+
assert set(split_pages) == set(single_pages)
52+
53+
assert abs(len(resp_split.elements) - len(resp_single.elements)) <= _allowed_delta(
54+
len(resp_single.elements),
55+
absolute=4,
56+
ratio=0.1,
57+
)
58+
59+
split_type_counts = Counter(element["type"] for element in resp_split.elements)
60+
single_type_counts = Counter(element["type"] for element in resp_single.elements)
61+
assert set(split_type_counts) == set(single_type_counts)
62+
for element_type, expected_count in single_type_counts.items():
63+
assert abs(split_type_counts[element_type] - expected_count) <= _allowed_delta(
64+
expected_count,
65+
absolute=2,
66+
ratio=0.2,
67+
)
68+
69+
assert abs(_text_size(resp_split.elements) - _text_size(resp_single.elements)) <= _allowed_delta(
70+
_text_size(resp_single.elements),
71+
absolute=250,
72+
ratio=0.2,
73+
)
74+
75+
for page_number, single_page_elements in single_pages.items():
76+
split_page_elements = split_pages[page_number]
77+
78+
assert abs(len(split_page_elements) - len(single_page_elements)) <= _allowed_delta(
79+
len(single_page_elements),
80+
absolute=2,
81+
ratio=0.2,
82+
)
83+
assert abs(_text_size(split_page_elements) - _text_size(single_page_elements)) <= _allowed_delta(
84+
_text_size(single_page_elements),
85+
absolute=120,
86+
ratio=0.3,
87+
)
88+
89+
90+
def _assert_split_unsplit_equivalent(resp_split, resp_single, strategy, extra_exclude_paths=None):
91+
"""Compare split-PDF and single-request responses.
92+
93+
For hi_res (OCR-based), splitting changes per-page context so text and
94+
OCR text can vary slightly. We still check page coverage, type distribution,
95+
and text volume so split requests cannot silently drift too far.
96+
For deterministic strategies (fast, etc.) we keep strict DeepDiff equality.
97+
"""
98+
assert resp_split.status_code == resp_single.status_code
99+
assert resp_split.content_type == resp_single.content_type
100+
101+
if strategy in _HI_RES_STRATEGIES:
102+
_assert_hi_res_output_is_similar(resp_split, resp_single)
103+
else:
104+
assert len(resp_split.elements) == len(resp_single.elements)
105+
106+
excludes = [r"root\[\d+\]\['metadata'\]\['parent_id'\]"]
107+
if extra_exclude_paths:
108+
excludes.extend(extra_exclude_paths)
109+
110+
diff = DeepDiff(
111+
t1=resp_split.elements,
112+
t2=resp_single.elements,
113+
exclude_regex_paths=excludes,
114+
)
115+
assert len(diff) == 0
25116

26117

27118
@pytest.mark.parametrize("concurrency_level", [1, 2, 5])
@@ -53,7 +144,7 @@ def test_integration_split_pdf_has_same_output_as_non_split(
53144
except requests.exceptions.ConnectionError:
54145
assert False, "The unstructured-api is not running on localhost:8000"
55146

56-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
147+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
57148

58149
with open(filename, "rb") as f:
59150
files = shared.Files(
@@ -100,18 +191,7 @@ def test_integration_split_pdf_has_same_output_as_non_split(
100191
request=req,
101192
)
102193

103-
assert len(resp_split.elements) == len(resp_single.elements)
104-
assert resp_split.content_type == resp_single.content_type
105-
assert resp_split.status_code == resp_single.status_code
106-
107-
diff = DeepDiff(
108-
t1=resp_split.elements,
109-
t2=resp_single.elements,
110-
exclude_regex_paths=[
111-
r"root\[\d+\]\['metadata'\]\['parent_id'\]",
112-
],
113-
)
114-
assert len(diff) == 0
194+
_assert_split_unsplit_equivalent(resp_split, resp_single, strategy)
115195

116196

117197
@pytest.mark.parametrize(("filename", "expected_ok", "strategy"), [
@@ -136,7 +216,7 @@ def test_integration_split_pdf_with_caching(
136216
except requests.exceptions.ConnectionError:
137217
assert False, "The unstructured-api is not running on localhost:8000"
138218

139-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
219+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
140220

141221
with open(filename, "rb") as f:
142222
files = shared.Files(
@@ -183,19 +263,7 @@ def test_integration_split_pdf_with_caching(
183263
request=req
184264
)
185265

186-
assert len(resp_split.elements) == len(resp_single.elements)
187-
assert resp_split.content_type == resp_single.content_type
188-
assert resp_split.status_code == resp_single.status_code
189-
190-
diff = DeepDiff(
191-
t1=resp_split.elements,
192-
t2=resp_single.elements,
193-
exclude_regex_paths=[
194-
r"root\[\d+\]\['metadata'\]\['parent_id'\]",
195-
r"root\[\d+\]\['element_id'\]",
196-
],
197-
)
198-
assert len(diff) == 0
266+
_assert_split_unsplit_equivalent(resp_split, resp_single, strategy)
199267

200268
# make sure the cache dir was cleaned if passed explicitly
201269
if cache_dir:
@@ -212,7 +280,7 @@ def test_long_pages_hi_res(filename):
212280
split_pdf_concurrency_level=15
213281
), )
214282

215-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
283+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
216284

217285
response = client.general.partition(
218286
request=req,
@@ -231,7 +299,7 @@ def test_integration_split_pdf_for_file_with_no_name():
231299
except requests.exceptions.ConnectionError:
232300
assert False, "The unstructured-api is not running on localhost:8000"
233301

234-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
302+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
235303

236304
with open("_sample_docs/layout-parser-paper-fast.pdf", "rb") as f:
237305
files = shared.Files(
@@ -287,7 +355,7 @@ def test_integration_split_pdf_with_page_range(
287355
except requests.exceptions.ConnectionError:
288356
assert False, "The unstructured-api is not running on localhost:8000"
289357

290-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
358+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
291359

292360
filename = "_sample_docs/layout-parser-paper.pdf"
293361
with open(filename, "rb") as f:
@@ -351,7 +419,7 @@ def test_integration_split_pdf_strict_mode(
351419
except requests.exceptions.ConnectionError:
352420
assert False, "The unstructured-api is not running on localhost:8000"
353421

354-
client = UnstructuredClient(api_key_auth=FAKE_KEY)
422+
client = UnstructuredClient(api_key_auth=FAKE_KEY, timeout_ms=TEST_TIMEOUT_MS)
355423

356424
with open(filename, "rb") as f:
357425
files = shared.Files(
@@ -400,18 +468,7 @@ def test_integration_split_pdf_strict_mode(
400468
server_url="http://localhost:8000",
401469
)
402470

403-
assert len(resp_split.elements) == len(resp_single.elements)
404-
assert resp_split.content_type == resp_single.content_type
405-
assert resp_split.status_code == resp_single.status_code
406-
407-
diff = DeepDiff(
408-
t1=resp_split.elements,
409-
t2=resp_single.elements,
410-
exclude_regex_paths=[
411-
r"root\[\d+\]\['metadata'\]\['parent_id'\]",
412-
],
413-
)
414-
assert len(diff) == 0
471+
_assert_split_unsplit_equivalent(resp_split, resp_single, strategy)
415472

416473

417474
@pytest.mark.asyncio

0 commit comments

Comments
 (0)