fix: replace NLTK with spaCy to remediate CVE-2025-14009 by lawrence-u10d · Pull Request #4252 · Unstructured-IO/unstructured

lawrence-u10d · 2026-02-20T16:35:43Z

Summary

CVE-2025-14009 (CVSS 10.0) affects NLTK's downloader — zipfile.extractall() without path validation enables RCE via malicious packages. No patch exists.
Replaces NLTK with spaCy (en_core_web_sm) across the entire codebase, eliminating the vulnerable downloader entirely since spaCy models install as pip packages.
The only POS-tag consumer is contains_verb() which checks for VB* tags — both NLTK and spaCy use Penn Treebank verb tags, so behavior is preserved.

Changes

unstructured/nlp/tokenize.py — rewritten to use spacy.load("en_core_web_sm")
requirements/base.in + base.txt — swap nltk for spacy + en-core-web-sm
Dockerfile + docker/rockylinux-9.2/Dockerfile — remove NLTK_DATA, add spacy download
Makefile — rename install-nltk-models → install-spacy-model
All CI workflows — remove NLTK_DATA env vars/cache paths, update make targets
typings/nltk/ — deleted (6 stub files)
test_unstructured/nlp/mock_nltk.py → mock_nlp.py

Test plan

CI passes: test_unstructured/nlp/test_tokenize.py
CI passes: test_unstructured/partition/test_text_type.py
Docker image builds successfully
Verify contains_verb behavior matches (Penn Treebank VB* tags)
If any POS tag edge-case diffs surface in tests, update expected values (behavioral parity, not a bug)

🤖 Generated with Claude Code

Note

Medium Risk
Touches core NLP tokenization and POS-tagging behavior used in text classification, so small output differences can cascade into partitioning/fixture changes. Dependency/model loading changes may also impact environments that don’t install en_core_web_sm correctly.

Overview
Replaces NLTK-based tokenization/POS tagging with spaCy (en_core_web_sm) to remove reliance on NLTK’s downloader and mitigate CVE-2025-14009.

This updates runtime/CI/Docker install flows to stop downloading NLTK data (NLTK_DATA, make install-nltk-models) and instead ship the spaCy model as a dependency (via pyproject.toml + uv source URL), removes NLTK typing stubs, bumps to 0.20.9, and refreshes tests/expected fixtures to match the new tokenization/classification outputs.

^{Written by Cursor Bugbot for commit b753ddc. This will update automatically on new commits. Configure here.}

NLTK's downloader uses zipfile.extractall() without path validation, enabling RCE via malicious packages (CVSS 10.0, no patch available). spaCy models install as pip packages, eliminating the vulnerable downloader entirely. - Rewrite unstructured/nlp/tokenize.py to use spaCy en_core_web_sm - Replace nltk with spacy + en-core-web-sm in requirements - Update Dockerfiles to install spaCy model instead of NLTK data - Update Makefile target install-nltk-models → install-spacy-model - Update all CI workflows to remove NLTK_DATA env/cache paths - Delete typings/nltk/ stubs (no longer needed) - Rename test mock_nltk.py → mock_nlp.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix numpy.str_ compatibility in spaCy tokenizer for OCR pipelines - Add clear error message when en_core_web_sm model is missing - Update test expectations for spaCy's contextual POS tagging: - "break" correctly tagged as noun in "section break" context - "sit" correctly tagged as verb in "Lorem ipsum dolor sit amet" - Simplify HTML join test to avoid classification sensitivity - Add regex dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Run _nlp() once per text input instead of 1+2N times for N sentences. Removes re-tokenization in _pos_tag that could alter token boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update expected element types in spring-weather.html.json to reflect spaCy's different POS tagging behavior on HTML fragments. These are markup strings where neither tagger produces meaningful classifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consolidates spaCy model installation into a single declaration in pyproject.toml with tool.uv.sources, removing manual install steps from Makefile, Dockerfile, CI workflows, and base-cache action. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix tokenize cache tests to not reference removed internal functions - Update pptx hierarchy test for Linux x86 spaCy output: "There's" gets tokenized with 's as POS (possessive) instead of VBZ (verb), changing the element from NarrativeText to Title - Update metrics element type frequency fixtures accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-20T19:55:18Z

+def _process(text: str) -> spacy.tokens.Doc:
+    """Run the spaCy pipeline once. All public functions extract what they need from the Doc."""
+    # -- str() handles numpy.str_ from OCR pipelines --
+    return _nlp(str(text))


Uncached _process causes triple spaCy pipeline execution per text

Medium Severity

The _process helper runs the full spaCy pipeline (tokenizer, tagger, parser, NER) but is not cached, while word_tokenize, pos_tag, and _tokenize_for_cache each have their own lru_cache and call _process independently. In the hot path (is_possible_narrative_text), all three are called for the same text, so the full spaCy pipeline runs three times per unique text element. NLTK's equivalent calls were orders of magnitude lighter. Adding @lru_cache to _process (safe since callers only read the Doc) would eliminate the redundant pipeline runs.

Additional Locations (1)

unstructured/nlp/tokenize.py#L30-L41

cursor · 2026-02-20T19:55:18Z

    "lxml>=5.0.0, <7.0.0",
-    "nltk>=3.9.2, <4.0.0",
+    "spacy>=3.7.0, <4.0.0",
+    "en-core-web-sm>=3.8.0, <4.0.0",


Non-PyPI dependency breaks pip installation of published package

High Severity

en-core-web-sm is listed in [project.dependencies] but is not published on PyPI — it's only available from GitHub releases. The [tool.uv.sources] URL override works for local uv sync, but is not propagated to the built wheel's metadata. When the package is published to PyPI (via twine), pip install unstructured will fail because pip cannot resolve en-core-web-sm>=3.8.0 from any index. This is a regression from nltk, which was fully PyPI-resolvable.

Additional Locations (1)

pyproject.toml#L183-L185

badGarnet · 2026-02-20T20:30:33Z

I would rather try to get nltk/nltk#3502 in (beg the maintainer 😅 )
spacy adds a lot of complexity into the code and it is a relatively massive package

lawrence-u10d force-pushed the fix/replace-nltk-with-spacy-cve-2025-14009 branch from e537d8d to c86aacd Compare February 20, 2026 16:39

lawrence-u10d force-pushed the fix/replace-nltk-with-spacy-cve-2025-14009 branch from c86aacd to c16d567 Compare February 20, 2026 16:42

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread unstructured/nlp/tokenize.py

Comment thread unstructured/nlp/tokenize.py Outdated

lawrence-u10d and others added 2 commits February 20, 2026 11:53

fix: eliminate redundant spaCy pipeline runs in tokenizer

01994c2

Run _nlp() once per text input instead of 1+2N times for N sentences. Removes re-tokenization in _pos_tag that could alter token boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread test_unstructured/nlp/test_tokenize.py

lawrence-u10d and others added 3 commits February 20, 2026 12:22

cursor Bot reviewed Feb 20, 2026

View reviewed changes

lawrence-u10d closed this Feb 20, 2026

lawrence-u10d deleted the fix/replace-nltk-with-spacy-cve-2025-14009 branch February 20, 2026 21:07

lawrence-u10d restored the fix/replace-nltk-with-spacy-cve-2025-14009 branch February 22, 2026 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252

fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252
lawrence-u10d wants to merge 6 commits intomainfrom
fix/replace-nltk-with-spacy-cve-2025-14009

lawrence-u10d commented Feb 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 20, 2026

Uh oh!

cursor Bot Feb 20, 2026

Uh oh!

badGarnet commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lawrence-u10d commented Feb 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 20, 2026

Choose a reason for hiding this comment

Uncached _process causes triple spaCy pipeline execution per text

Uh oh!

cursor Bot Feb 20, 2026

Choose a reason for hiding this comment

Non-PyPI dependency breaks pip installation of published package

Uh oh!

badGarnet commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lawrence-u10d commented Feb 20, 2026 •

edited by cursor Bot

Loading

Uncached `_process` causes triple spaCy pipeline execution per text