Skip to content

fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252

Closed
lawrence-u10d wants to merge 6 commits intomainfrom
fix/replace-nltk-with-spacy-cve-2025-14009
Closed

fix: replace NLTK with spaCy to remediate CVE-2025-14009#4252
lawrence-u10d wants to merge 6 commits intomainfrom
fix/replace-nltk-with-spacy-cve-2025-14009

Conversation

@lawrence-u10d
Copy link
Copy Markdown
Contributor

@lawrence-u10d lawrence-u10d commented Feb 20, 2026

Summary

  • CVE-2025-14009 (CVSS 10.0) affects NLTK's downloader — zipfile.extractall() without path validation enables RCE via malicious packages. No patch exists.
  • Replaces NLTK with spaCy (en_core_web_sm) across the entire codebase, eliminating the vulnerable downloader entirely since spaCy models install as pip packages.
  • The only POS-tag consumer is contains_verb() which checks for VB* tags — both NLTK and spaCy use Penn Treebank verb tags, so behavior is preserved.

Changes

  • unstructured/nlp/tokenize.py — rewritten to use spacy.load("en_core_web_sm")
  • requirements/base.in + base.txt — swap nltk for spacy + en-core-web-sm
  • Dockerfile + docker/rockylinux-9.2/Dockerfile — remove NLTK_DATA, add spacy download
  • Makefile — rename install-nltk-modelsinstall-spacy-model
  • All CI workflows — remove NLTK_DATA env vars/cache paths, update make targets
  • typings/nltk/ — deleted (6 stub files)
  • test_unstructured/nlp/mock_nltk.pymock_nlp.py

Test plan

  • CI passes: test_unstructured/nlp/test_tokenize.py
  • CI passes: test_unstructured/partition/test_text_type.py
  • Docker image builds successfully
  • Verify contains_verb behavior matches (Penn Treebank VB* tags)
  • If any POS tag edge-case diffs surface in tests, update expected values (behavioral parity, not a bug)

🤖 Generated with Claude Code


Note

Medium Risk
Touches core NLP tokenization and POS-tagging behavior used in text classification, so small output differences can cascade into partitioning/fixture changes. Dependency/model loading changes may also impact environments that don’t install en_core_web_sm correctly.

Overview
Replaces NLTK-based tokenization/POS tagging with spaCy (en_core_web_sm) to remove reliance on NLTK’s downloader and mitigate CVE-2025-14009.

This updates runtime/CI/Docker install flows to stop downloading NLTK data (NLTK_DATA, make install-nltk-models) and instead ship the spaCy model as a dependency (via pyproject.toml + uv source URL), removes NLTK typing stubs, bumps to 0.20.9, and refreshes tests/expected fixtures to match the new tokenization/classification outputs.

Written by Cursor Bugbot for commit b753ddc. This will update automatically on new commits. Configure here.

@lawrence-u10d lawrence-u10d force-pushed the fix/replace-nltk-with-spacy-cve-2025-14009 branch from e537d8d to c86aacd Compare February 20, 2026 16:39
NLTK's downloader uses zipfile.extractall() without path validation,
enabling RCE via malicious packages (CVSS 10.0, no patch available).
spaCy models install as pip packages, eliminating the vulnerable
downloader entirely.

- Rewrite unstructured/nlp/tokenize.py to use spaCy en_core_web_sm
- Replace nltk with spacy + en-core-web-sm in requirements
- Update Dockerfiles to install spaCy model instead of NLTK data
- Update Makefile target install-nltk-models → install-spacy-model
- Update all CI workflows to remove NLTK_DATA env/cache paths
- Delete typings/nltk/ stubs (no longer needed)
- Rename test mock_nltk.py → mock_nlp.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lawrence-u10d lawrence-u10d force-pushed the fix/replace-nltk-with-spacy-cve-2025-14009 branch from c86aacd to c16d567 Compare February 20, 2026 16:42
Comment thread unstructured/nlp/tokenize.py
Comment thread unstructured/nlp/tokenize.py Outdated
lawrence-u10d and others added 2 commits February 20, 2026 11:53
- Fix numpy.str_ compatibility in spaCy tokenizer for OCR pipelines
- Add clear error message when en_core_web_sm model is missing
- Update test expectations for spaCy's contextual POS tagging:
  - "break" correctly tagged as noun in "section break" context
  - "sit" correctly tagged as verb in "Lorem ipsum dolor sit amet"
  - Simplify HTML join test to avoid classification sensitivity
- Add regex dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run _nlp() once per text input instead of 1+2N times for N sentences.
Removes re-tokenization in _pos_tag that could alter token boundaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread test_unstructured/nlp/test_tokenize.py
lawrence-u10d and others added 3 commits February 20, 2026 12:22
Update expected element types in spring-weather.html.json to reflect
spaCy's different POS tagging behavior on HTML fragments. These are
markup strings where neither tagger produces meaningful classifications.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidates spaCy model installation into a single declaration in
pyproject.toml with tool.uv.sources, removing manual install steps
from Makefile, Dockerfile, CI workflows, and base-cache action.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix tokenize cache tests to not reference removed internal functions
- Update pptx hierarchy test for Linux x86 spaCy output: "There's" gets
  tokenized with 's as POS (possessive) instead of VBZ (verb), changing
  the element from NarrativeText to Title
- Update metrics element type frequency fixtures accordingly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

def _process(text: str) -> spacy.tokens.Doc:
"""Run the spaCy pipeline once. All public functions extract what they need from the Doc."""
# -- str() handles numpy.str_ from OCR pipelines --
return _nlp(str(text))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncached _process causes triple spaCy pipeline execution per text

Medium Severity

The _process helper runs the full spaCy pipeline (tokenizer, tagger, parser, NER) but is not cached, while word_tokenize, pos_tag, and _tokenize_for_cache each have their own lru_cache and call _process independently. In the hot path (is_possible_narrative_text), all three are called for the same text, so the full spaCy pipeline runs three times per unique text element. NLTK's equivalent calls were orders of magnitude lighter. Adding @lru_cache to _process (safe since callers only read the Doc) would eliminate the redundant pipeline runs.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment thread pyproject.toml
"lxml>=5.0.0, <7.0.0",
"nltk>=3.9.2, <4.0.0",
"spacy>=3.7.0, <4.0.0",
"en-core-web-sm>=3.8.0, <4.0.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-PyPI dependency breaks pip installation of published package

High Severity

en-core-web-sm is listed in [project.dependencies] but is not published on PyPI — it's only available from GitHub releases. The [tool.uv.sources] URL override works for local uv sync, but is not propagated to the built wheel's metadata. When the package is published to PyPI (via twine), pip install unstructured will fail because pip cannot resolve en-core-web-sm>=3.8.0 from any index. This is a regression from nltk, which was fully PyPI-resolvable.

Additional Locations (1)

Fix in Cursor Fix in Web

@badGarnet
Copy link
Copy Markdown
Collaborator

I would rather try to get nltk/nltk#3502 in (beg the maintainer 😅 )
spacy adds a lot of complexity into the code and it is a relatively massive package

@lawrence-u10d lawrence-u10d deleted the fix/replace-nltk-with-spacy-cve-2025-14009 branch February 20, 2026 21:07
@lawrence-u10d lawrence-u10d restored the fix/replace-nltk-with-spacy-cve-2025-14009 branch February 22, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants