Skip to content

Fix: replace nltk with spacy CVE 2025 14009#4255

Merged
badGarnet merged 11 commits intomainfrom
fix/replace-nltk-with-spacy-cve-2025-14009
Feb 22, 2026
Merged

Fix: replace nltk with spacy CVE 2025 14009#4255
badGarnet merged 11 commits intomainfrom
fix/replace-nltk-with-spacy-cve-2025-14009

Conversation

@badGarnet
Copy link
Copy Markdown
Collaborator

No description provided.

lawrence-u10d and others added 11 commits February 20, 2026 10:42
NLTK's downloader uses zipfile.extractall() without path validation,
enabling RCE via malicious packages (CVSS 10.0, no patch available).
spaCy models install as pip packages, eliminating the vulnerable
downloader entirely.

- Rewrite unstructured/nlp/tokenize.py to use spaCy en_core_web_sm
- Replace nltk with spacy + en-core-web-sm in requirements
- Update Dockerfiles to install spaCy model instead of NLTK data
- Update Makefile target install-nltk-models → install-spacy-model
- Update all CI workflows to remove NLTK_DATA env/cache paths
- Delete typings/nltk/ stubs (no longer needed)
- Rename test mock_nltk.py → mock_nlp.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix numpy.str_ compatibility in spaCy tokenizer for OCR pipelines
- Add clear error message when en_core_web_sm model is missing
- Update test expectations for spaCy's contextual POS tagging:
  - "break" correctly tagged as noun in "section break" context
  - "sit" correctly tagged as verb in "Lorem ipsum dolor sit amet"
  - Simplify HTML join test to avoid classification sensitivity
- Add regex dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run _nlp() once per text input instead of 1+2N times for N sentences.
Removes re-tokenization in _pos_tag that could alter token boundaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update expected element types in spring-weather.html.json to reflect
spaCy's different POS tagging behavior on HTML fragments. These are
markup strings where neither tagger produces meaningful classifications.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidates spaCy model installation into a single declaration in
pyproject.toml with tool.uv.sources, removing manual install steps
from Makefile, Dockerfile, CI workflows, and base-cache action.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix tokenize cache tests to not reference removed internal functions
- Update pptx hierarchy test for Linux x86 spaCy output: "There's" gets
  tokenized with 's as POS (possessive) instead of VBZ (verb), changing
  the element from NarrativeText to Title
- Update metrics element type frequency fixtures accordingly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate (#4256)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Only golden test artifacts change (no runtime code paths), with risk
limited to masking unintended output regressions if the new fixtures are
incorrect.
> 
> **Overview**
> Updates ingest *expected output fixtures* (HTML + JSON) to reflect new
document parsing/classification results.
> 
> Across multiple fixture sets (Azure HTML, PDF reprocess, and
multilingual UDHR text), element `type`/HTML tag assignments shift
(notably `UncategorizedText` ↔ `NarrativeText`, and some `p` ↔ `h1`),
and composite element boundaries/IDs change (e.g., `multi-column-2p.pdf`
splits/moves the DPR GitHub link into its own element).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c5047da. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
@badGarnet badGarnet marked this pull request as ready for review February 22, 2026 18:20
@badGarnet badGarnet added this pull request to the merge queue Feb 22, 2026
Merged via the queue into main with commit 3db7b4f Feb 22, 2026
51 checks passed
@badGarnet badGarnet deleted the fix/replace-nltk-with-spacy-cve-2025-14009 branch February 22, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants