mem: exclude unused spaCy pipeline components to reduce model memory by KRRT7 · Pull Request #4296 · Unstructured-IO/unstructured

KRRT7 · 2026-03-24T12:31:59Z

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, lemmatizer, and attribute_ruler when loading en_core_web_sm, keeping parser for accurate sentence boundary detection. Saves ~14 MiB peak memory per process.

Benchmark

Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM, Python 3.12.12

`test_benchmark_load_spacy_model`

Ref	Peak Memory	Allocations	Delta
`b6cf510684e5` (base)	53.3 MiB	179
`a3172f8eb66b` (head)	39.3 MiB	187	-26%

`test_benchmark_spacy_nlp_pipeline`

Ref	Peak Memory	Allocations	Delta
`b6cf510684e5` (base)	54.9 MiB	186
`a3172f8eb66b` (head)	42.1 MiB	139	-23%

Generated by codeflash compare

Reproduce the benchmark locally

# Full comparison (timing + memory):
uv run codeflash compare b6cf510684e594d6c18e19129b6b8da668072b2d a3172f8eb66bb42674fbc70e18b49f4dbe1dc30b --memory \
  --inject benchmarks/test_benchmark_spacy_load.py \
  --inject benchmarks/__init__.py \
  --inject pyproject.toml

Benchmark test source

"""Benchmark for spaCy model loading with component exclusion.

PR #4296 excludes ner, lemmatizer, and attribute_ruler when loading en_core_web_sm.
This benchmark exercises _load_spacy_model() which is the function that applies the
exclude list. It also runs the downstream NLP functions (pos_tag, sent_tokenize,
word_tokenize) to verify the pipeline still works correctly with excluded components.

Uses real spaCy — no mocking. The en_core_web_sm model must be installed.
"""

from __future__ import annotations

import importlib

import pytest


# Clear the lru_cache between runs so each invocation triggers a fresh model load,
# exercising the actual spacy.load() call with or without the exclude list.
@pytest.fixture(autouse=True)
def _clear_spacy_cache():
    """Reset cached spaCy model so each benchmark invocation reloads from disk."""
    from unstructured.nlp import tokenize

    tokenize._get_nlp.cache_clear()
    yield
    tokenize._get_nlp.cache_clear()


# Realistic text samples for downstream NLP functions
_SAMPLE_TEXTS = [
    "The quick brown fox jumps over the lazy dog. It was a sunny afternoon.",
    "Unstructured provides open-source tools for ingesting and pre-processing "
    "images and text documents. The library supports PDF, Word, HTML, and more.",
    "Natural language processing enables computers to understand human language. "
    "Applications include sentiment analysis, named entity recognition, and translation.",
    "In 2024, the global AI market was valued at $196 billion. Major players include "
    "Google, Microsoft, and OpenAI. The sector is expected to grow at 37% CAGR.",
    "Dr. Smith arrived at 3:00 p.m. and reviewed the patient's records. She noted "
    "that the blood pressure was 120/80 mmHg — within the normal range.",
]


def test_benchmark_load_spacy_model():
    """Benchmark _load_spacy_model — the function affected by the exclude= parameter."""
    from unstructured.nlp.tokenize import _load_spacy_model

    nlp = _load_spacy_model()
    assert nlp is not None


def test_benchmark_spacy_nlp_pipeline():
    """Benchmark the full NLP pipeline: load model + process text through pos_tag,
    sent_tokenize, and word_tokenize to confirm excluded components don't break
    downstream usage and to capture end-to-end memory impact."""
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize

    for text in _SAMPLE_TEXTS:
        tags = pos_tag(text)
        sents = sent_tokenize(text)
        tokens = word_tokenize(text)
        assert len(tags) > 0
        assert len(sents) > 0
        assert len(tokens) > 0

Test plan

codeflash compare --memory confirms -26% peak memory on _load_spacy_model (53.3 to 39.3 MiB)
Full NLP pipeline (pos_tag + sent_tokenize + word_tokenize) shows -23% peak memory (54.9 to 42.1 MiB)
pos_tag, sent_tokenize, word_tokenize all produce correct results with excluded components

badGarnet

The trade-off — sentence splitting quality:

Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.

I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection. Saves ~12 MiB of model weights per process.

Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.

KRRT7 · 2026-03-27T18:53:40Z

Good call — updated to keep parser loaded for accurate sentence boundaries. Now only excluding ner, lemmatizer, and attribute_ruler. Memory savings drop from ~12.7 MiB to ~7 MiB but we avoid the sentence splitting regression.

Also rebased onto main and bumped to 0.22.9.

badGarnet reviewed Mar 27, 2026

View reviewed changes

KRRT7 added 3 commits March 27, 2026 13:51

Bump version to 0.22.3 for changelog CI check

b565a50

fix: keep parser for accurate sentence boundaries

23c4fff

Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.

KRRT7 force-pushed the mem/spacy-exclude-unused branch from 2291c23 to 23c4fff Compare March 27, 2026 18:53

merge: resolve conflicts with main, bump to 0.22.11

94fd162

cragwolfe enabled auto-merge March 31, 2026 17:52

badGarnet approved these changes Mar 31, 2026

View reviewed changes

cragwolfe added this pull request to the merge queue Mar 31, 2026

cragwolfe approved these changes Mar 31, 2026

View reviewed changes

Merged via the queue into Unstructured-IO:main with commit a3172f8 Mar 31, 2026
53 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mem: exclude unused spaCy pipeline components to reduce model memory#4296

mem: exclude unused spaCy pipeline components to reduce model memory#4296
cragwolfe merged 4 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused

KRRT7 commented Mar 24, 2026 •

edited

Loading

Uh oh!

badGarnet left a comment

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KRRT7 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM, Python 3.12.12

test_benchmark_load_spacy_model

test_benchmark_spacy_nlp_pipeline

Test plan

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Mar 24, 2026 •

edited

Loading

`test_benchmark_load_spacy_model`

`test_benchmark_spacy_nlp_pipeline`