Skip to content

Commit 4a77a8c

Browse files
lawrence-u10dclaudebadGarnet
authored
fix: self-install pinned spaCy model at runtime with SHA256 verification (#4258)
## Summary - Replace `en-core-web-sm` direct URL dependency in `pyproject.toml` with the `installer` library - spaCy model is now downloaded and installed on first use with SHA256 hash verification - Removes `[tool.uv.sources]` section, making the install more portable across package managers ## Test plan - [ ] Verify `tokenize.py` downloads and installs the spaCy model on first use - [ ] Verify SHA256 hash check rejects tampered wheels - [ ] Verify existing NLP tokenization tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Introduces runtime download-and-install behavior into the NLP path and writes into `site-packages`, which can fail under restricted networking/permissions or in unusual multi-process environments despite locking and hash checks. > > **Overview** > Updates NLP tokenization to **lazy-load and self-install** the pinned `en_core_web_sm` spaCy model on first use, downloading the wheel from GitHub and verifying it via **SHA256**, with a cross-process `FileLock` to avoid concurrent installs. > > Removes the `en-core-web-sm` wheel URL dependency and `[tool.uv.sources]` override, adding `installer` (for wheel installation) and `filelock` (for install locking) to dependencies; `uv.lock` is updated accordingly and the version is bumped to `0.21.2`. > > Adjusts the `Dockerfile` to trigger model installation during image build (via `uv run` importing `_get_nlp`) so the model is present before `HF_HUB_OFFLINE=1` is set. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit df62a9c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yao You <yao@unstructured.io>
1 parent 47b8b5e commit 4a77a8c

6 files changed

Lines changed: 157 additions & 24 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
## 0.21.2
2+
3+
### Fixes
4+
- **Self-install pinned spaCy model at runtime with SHA256 verification**: Replace the `en-core-web-sm` direct URL dependency in `pyproject.toml` with the `installer` library. The spaCy model is now downloaded and installed on first use with hash verification, removing the need for `[tool.uv.sources]` and making the install more portable.
5+
16
## 0.21.1
27

38
- Bump version to create a new release

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,9 @@ ENV TESSDATA_PREFIX=/usr/local/share/tessdata
7171
ENV UV_COMPILE_BYTECODE=1
7272
ENV UV_PYTHON_DOWNLOADS=never
7373

74-
# Install Python dependencies via uv (en-core-web-sm is declared in pyproject.toml)
74+
# Install Python dependencies via uv, then trigger spaCy model self-install while network is available
7575
RUN uv sync --locked --all-extras --no-group dev --no-group lint --no-group test --no-group release && \
76+
uv run --no-sync $PYTHON -c "from unstructured.nlp.tokenize import _get_nlp; print('spaCy model loaded:', _get_nlp().meta['name'])" && \
7677
uv run --no-sync $PYTHON -c "from unstructured.partition.model_init import initialize; initialize()" && \
7778
uv run --no-sync $PYTHON -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
7879

pyproject.toml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ dependencies = [
2929
"langdetect>=1.0.9, <2.0.0",
3030
"lxml>=5.0.0, <7.0.0",
3131
"spacy>=3.7.0, <4.0.0",
32-
"en-core-web-sm>=3.8.0, <4.0.0",
32+
"installer>=0.7.0, <1.0.0",
3333
"numba>=0.60.0, <1.0.0",
3434
"numpy>=1.26.0, <3.0.0",
3535
"psutil>=7.2.2, <8.0.0",
@@ -43,6 +43,7 @@ dependencies = [
4343
"typing-extensions>=4.15.0, <5.0.0",
4444
"unstructured-client>=0.25.9, <1.0.0",
4545
"wrapt>=1.0.0, <2.0.0",
46+
"filelock>=3.12.0,<4.0.0",
4647
]
4748

4849
[project.optional-dependencies]
@@ -181,9 +182,6 @@ release = [
181182
"twine>=6.0.0, <7.0.0",
182183
]
183184

184-
[tool.uv.sources]
185-
en-core-web-sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl" }
186-
187185
[tool.uv]
188186
required-environments = [
189187
"sys_platform == 'linux' and platform_machine == 'x86_64'",

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.21.1" # pragma: no cover
1+
__version__ = "0.21.2" # pragma: no cover

unstructured/nlp/tokenize.py

Lines changed: 134 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,151 @@
11
from __future__ import annotations
22

3+
import hashlib
4+
import importlib
5+
import logging
6+
import os
7+
import shutil
8+
import sys
9+
import sysconfig
10+
import tempfile
11+
import urllib.error
12+
import urllib.request
313
from functools import lru_cache
414
from typing import Final, List, Tuple
515

616
import spacy
17+
from filelock import FileLock
18+
19+
logger = logging.getLogger(__name__)
720

821
CACHE_MAX_SIZE: Final[int] = 128
922

10-
try:
11-
_nlp = spacy.load("en_core_web_sm")
12-
except OSError:
13-
raise OSError(
14-
"The spacy model 'en_core_web_sm' is required but not installed. "
15-
"Install it with: python -m spacy download en_core_web_sm"
16-
)
23+
_SPACY_MODEL_NAME: Final[str] = "en_core_web_sm"
24+
_SPACY_MODEL_VERSION: Final[str] = "3.8.0"
25+
_SPACY_MODEL_URL: Final[str] = (
26+
f"https://github.com/explosion/spacy-models/releases/download/"
27+
f"{_SPACY_MODEL_NAME}-{_SPACY_MODEL_VERSION}/"
28+
f"{_SPACY_MODEL_NAME}-{_SPACY_MODEL_VERSION}-py3-none-any.whl"
29+
)
30+
_SPACY_MODEL_SHA256: Final[str] = "1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85"
31+
32+
33+
_DOWNLOAD_TIMEOUT_SECONDS: Final[int] = 120
34+
_INSTALL_LOCK_PATH: Final[str] = os.path.join(
35+
tempfile.gettempdir(), f"{_SPACY_MODEL_NAME}.install.lock"
36+
)
37+
38+
39+
def _download_with_timeout(url: str, dest: str) -> None:
40+
"""Download a URL to a local file with a socket-level timeout."""
41+
try:
42+
with urllib.request.urlopen(url, timeout=_DOWNLOAD_TIMEOUT_SECONDS) as resp:
43+
with open(dest, "wb") as out:
44+
shutil.copyfileobj(resp, out)
45+
except urllib.error.URLError as exc:
46+
raise RuntimeError(
47+
f"Failed to download spaCy model from {url}: {exc}. "
48+
"Check your network connection and try again."
49+
) from exc
50+
51+
52+
def _install_spacy_model() -> None:
53+
"""Download and install the pinned spaCy model wheel using the `installer` library."""
54+
from installer import install
55+
from installer.destinations import SchemeDictionaryDestination
56+
from installer.sources import WheelFile
57+
from installer.utils import get_launcher_kind
58+
59+
with tempfile.TemporaryDirectory() as tmp:
60+
whl_path = os.path.join(tmp, f"{_SPACY_MODEL_NAME}-{_SPACY_MODEL_VERSION}-py3-none-any.whl")
61+
logger.info("Downloading spaCy model %s %s …", _SPACY_MODEL_NAME, _SPACY_MODEL_VERSION)
62+
_download_with_timeout(_SPACY_MODEL_URL, whl_path)
63+
64+
with open(whl_path, "rb") as f:
65+
sha256 = hashlib.sha256(f.read()).hexdigest()
66+
if sha256 != _SPACY_MODEL_SHA256:
67+
raise RuntimeError(
68+
f"Hash mismatch for {_SPACY_MODEL_NAME}: "
69+
f"expected {_SPACY_MODEL_SHA256}, got {sha256}"
70+
)
71+
72+
# Install into a staging directory to avoid races with other processes
73+
staging = os.path.join(tmp, "staging")
74+
paths = sysconfig.get_paths()
75+
staged_paths = paths.copy()
76+
staged_paths["purelib"] = staging
77+
staged_paths["platlib"] = staging
78+
79+
destination = SchemeDictionaryDestination(
80+
staged_paths,
81+
interpreter=sys.executable,
82+
script_kind=get_launcher_kind(),
83+
)
84+
with WheelFile.open(whl_path) as source:
85+
install(source=source, destination=destination, additional_metadata={})
86+
87+
# Move installed packages from staging into real site-packages.
88+
# The caller holds _INSTALL_LOCK_PATH so no other process races here.
89+
# Any dst that already exists is a remnant of a previous failed install
90+
# (spacy.load() just failed), so remove it before moving to avoid
91+
# shutil.move placing src *inside* an existing directory.
92+
site_packages = paths["purelib"]
93+
for item in os.listdir(staging):
94+
src = os.path.join(staging, item)
95+
dst = os.path.join(site_packages, item)
96+
try:
97+
if os.path.isdir(dst):
98+
shutil.rmtree(dst)
99+
elif os.path.exists(dst):
100+
os.remove(dst)
101+
shutil.move(src, dst)
102+
except OSError as exc:
103+
raise RuntimeError(
104+
f"Failed to install {_SPACY_MODEL_NAME} to {site_packages}: {exc}. "
105+
"Ensure the site-packages directory is writable, or pre-install the model "
106+
f"with: python -m spacy download {_SPACY_MODEL_NAME}"
107+
) from exc
108+
109+
logger.info("Installed %s %s", _SPACY_MODEL_NAME, _SPACY_MODEL_VERSION)
110+
111+
112+
def _load_spacy_model() -> spacy.language.Language:
113+
try:
114+
return spacy.load(_SPACY_MODEL_NAME)
115+
except OSError:
116+
pass
117+
118+
# Serialize model installation across processes with an exclusive file lock.
119+
# A well-known path in the system temp dir is visible to all processes
120+
# regardless of their working directory.
121+
with FileLock(_INSTALL_LOCK_PATH, timeout=-1):
122+
# Double-check: another process may have installed while we waited.
123+
importlib.invalidate_caches()
124+
try:
125+
return spacy.load(_SPACY_MODEL_NAME)
126+
except OSError:
127+
pass
128+
_install_spacy_model()
129+
importlib.invalidate_caches()
130+
try:
131+
return spacy.load(_SPACY_MODEL_NAME)
132+
except OSError as exc:
133+
raise RuntimeError(
134+
f"Installed {_SPACY_MODEL_NAME} but spacy.load() still failed. "
135+
"Check site-packages permissions and installation integrity."
136+
) from exc
137+
138+
139+
@lru_cache(maxsize=1)
140+
def _get_nlp() -> spacy.language.Language:
141+
"""Load the spaCy model on first use and cache it for the lifetime of the process."""
142+
return _load_spacy_model()
17143

18144

19145
def _process(text: str) -> spacy.tokens.Doc:
20146
"""Run the spaCy pipeline once. All public functions extract what they need from the Doc."""
21147
# -- str() handles numpy.str_ from OCR pipelines --
22-
return _nlp(str(text))
148+
return _get_nlp()(str(text))
23149

24150

25151
def sent_tokenize(text: str) -> List[str]:

uv.lock

Lines changed: 13 additions & 10 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)