fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252
fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252lawrence-u10d wants to merge 6 commits intomainfrom
Conversation
e537d8d to
c86aacd
Compare
NLTK's downloader uses zipfile.extractall() without path validation, enabling RCE via malicious packages (CVSS 10.0, no patch available). spaCy models install as pip packages, eliminating the vulnerable downloader entirely. - Rewrite unstructured/nlp/tokenize.py to use spaCy en_core_web_sm - Replace nltk with spacy + en-core-web-sm in requirements - Update Dockerfiles to install spaCy model instead of NLTK data - Update Makefile target install-nltk-models → install-spacy-model - Update all CI workflows to remove NLTK_DATA env/cache paths - Delete typings/nltk/ stubs (no longer needed) - Rename test mock_nltk.py → mock_nlp.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c86aacd to
c16d567
Compare
- Fix numpy.str_ compatibility in spaCy tokenizer for OCR pipelines - Add clear error message when en_core_web_sm model is missing - Update test expectations for spaCy's contextual POS tagging: - "break" correctly tagged as noun in "section break" context - "sit" correctly tagged as verb in "Lorem ipsum dolor sit amet" - Simplify HTML join test to avoid classification sensitivity - Add regex dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run _nlp() once per text input instead of 1+2N times for N sentences. Removes re-tokenization in _pos_tag that could alter token boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update expected element types in spring-weather.html.json to reflect spaCy's different POS tagging behavior on HTML fragments. These are markup strings where neither tagger produces meaningful classifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidates spaCy model installation into a single declaration in pyproject.toml with tool.uv.sources, removing manual install steps from Makefile, Dockerfile, CI workflows, and base-cache action. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix tokenize cache tests to not reference removed internal functions - Update pptx hierarchy test for Linux x86 spaCy output: "There's" gets tokenized with 's as POS (possessive) instead of VBZ (verb), changing the element from NarrativeText to Title - Update metrics element type frequency fixtures accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| def _process(text: str) -> spacy.tokens.Doc: | ||
| """Run the spaCy pipeline once. All public functions extract what they need from the Doc.""" | ||
| # -- str() handles numpy.str_ from OCR pipelines -- | ||
| return _nlp(str(text)) |
There was a problem hiding this comment.
Uncached _process causes triple spaCy pipeline execution per text
Medium Severity
The _process helper runs the full spaCy pipeline (tokenizer, tagger, parser, NER) but is not cached, while word_tokenize, pos_tag, and _tokenize_for_cache each have their own lru_cache and call _process independently. In the hot path (is_possible_narrative_text), all three are called for the same text, so the full spaCy pipeline runs three times per unique text element. NLTK's equivalent calls were orders of magnitude lighter. Adding @lru_cache to _process (safe since callers only read the Doc) would eliminate the redundant pipeline runs.
Additional Locations (1)
| "lxml>=5.0.0, <7.0.0", | ||
| "nltk>=3.9.2, <4.0.0", | ||
| "spacy>=3.7.0, <4.0.0", | ||
| "en-core-web-sm>=3.8.0, <4.0.0", |
There was a problem hiding this comment.
Non-PyPI dependency breaks pip installation of published package
High Severity
en-core-web-sm is listed in [project.dependencies] but is not published on PyPI — it's only available from GitHub releases. The [tool.uv.sources] URL override works for local uv sync, but is not propagated to the built wheel's metadata. When the package is published to PyPI (via twine), pip install unstructured will fail because pip cannot resolve en-core-web-sm>=3.8.0 from any index. This is a regression from nltk, which was fully PyPI-resolvable.
Additional Locations (1)
|
I would rather try to get nltk/nltk#3502 in (beg the maintainer 😅 ) |


Summary
zipfile.extractall()without path validation enables RCE via malicious packages. No patch exists.en_core_web_sm) across the entire codebase, eliminating the vulnerable downloader entirely since spaCy models install as pip packages.contains_verb()which checks forVB*tags — both NLTK and spaCy use Penn Treebank verb tags, so behavior is preserved.Changes
unstructured/nlp/tokenize.py— rewritten to usespacy.load("en_core_web_sm")requirements/base.in+base.txt— swapnltkforspacy+en-core-web-smDockerfile+docker/rockylinux-9.2/Dockerfile— removeNLTK_DATA, addspacy downloadMakefile— renameinstall-nltk-models→install-spacy-modelNLTK_DATAenv vars/cache paths, update make targetstypings/nltk/— deleted (6 stub files)test_unstructured/nlp/mock_nltk.py→mock_nlp.pyTest plan
test_unstructured/nlp/test_tokenize.pytest_unstructured/partition/test_text_type.pycontains_verbbehavior matches (Penn TreebankVB*tags)🤖 Generated with Claude Code
Note
Medium Risk
Touches core NLP tokenization and POS-tagging behavior used in text classification, so small output differences can cascade into partitioning/fixture changes. Dependency/model loading changes may also impact environments that don’t install
en_core_web_smcorrectly.Overview
Replaces NLTK-based tokenization/POS tagging with spaCy (
en_core_web_sm) to remove reliance on NLTK’s downloader and mitigate CVE-2025-14009.This updates runtime/CI/Docker install flows to stop downloading NLTK data (
NLTK_DATA,make install-nltk-models) and instead ship the spaCy model as a dependency (viapyproject.toml+uvsource URL), removes NLTK typing stubs, bumps to0.20.9, and refreshes tests/expected fixtures to match the new tokenization/classification outputs.Written by Cursor Bugbot for commit b753ddc. This will update automatically on new commits. Configure here.