Skip to content

mem: exclude unused spaCy pipeline components to reduce model memory#4296

Merged
cragwolfe merged 4 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused
Mar 31, 2026
Merged

mem: exclude unused spaCy pipeline components to reduce model memory#4296
cragwolfe merged 4 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Mar 24, 2026

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, lemmatizer, and attribute_ruler when loading en_core_web_sm, keeping parser for accurate sentence boundary detection. Saves ~14 MiB peak memory per process.

Benchmark

Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM, Python 3.12.12

test_benchmark_load_spacy_model

Ref Peak Memory Allocations Delta
b6cf510684e5 (base) 53.3 MiB 179
a3172f8eb66b (head) 39.3 MiB 187 -26%

test_benchmark_spacy_nlp_pipeline

Ref Peak Memory Allocations Delta
b6cf510684e5 (base) 54.9 MiB 186
a3172f8eb66b (head) 42.1 MiB 139 -23%

Generated by codeflash compare

Reproduce the benchmark locally
# Full comparison (timing + memory):
uv run codeflash compare b6cf510684e594d6c18e19129b6b8da668072b2d a3172f8eb66bb42674fbc70e18b49f4dbe1dc30b --memory \
  --inject benchmarks/test_benchmark_spacy_load.py \
  --inject benchmarks/__init__.py \
  --inject pyproject.toml
Benchmark test source
"""Benchmark for spaCy model loading with component exclusion.

PR #4296 excludes ner, lemmatizer, and attribute_ruler when loading en_core_web_sm.
This benchmark exercises _load_spacy_model() which is the function that applies the
exclude list. It also runs the downstream NLP functions (pos_tag, sent_tokenize,
word_tokenize) to verify the pipeline still works correctly with excluded components.

Uses real spaCy — no mocking. The en_core_web_sm model must be installed.
"""

from __future__ import annotations

import importlib

import pytest


# Clear the lru_cache between runs so each invocation triggers a fresh model load,
# exercising the actual spacy.load() call with or without the exclude list.
@pytest.fixture(autouse=True)
def _clear_spacy_cache():
    """Reset cached spaCy model so each benchmark invocation reloads from disk."""
    from unstructured.nlp import tokenize

    tokenize._get_nlp.cache_clear()
    yield
    tokenize._get_nlp.cache_clear()


# Realistic text samples for downstream NLP functions
_SAMPLE_TEXTS = [
    "The quick brown fox jumps over the lazy dog. It was a sunny afternoon.",
    "Unstructured provides open-source tools for ingesting and pre-processing "
    "images and text documents. The library supports PDF, Word, HTML, and more.",
    "Natural language processing enables computers to understand human language. "
    "Applications include sentiment analysis, named entity recognition, and translation.",
    "In 2024, the global AI market was valued at $196 billion. Major players include "
    "Google, Microsoft, and OpenAI. The sector is expected to grow at 37% CAGR.",
    "Dr. Smith arrived at 3:00 p.m. and reviewed the patient's records. She noted "
    "that the blood pressure was 120/80 mmHg — within the normal range.",
]


def test_benchmark_load_spacy_model():
    """Benchmark _load_spacy_model — the function affected by the exclude= parameter."""
    from unstructured.nlp.tokenize import _load_spacy_model

    nlp = _load_spacy_model()
    assert nlp is not None


def test_benchmark_spacy_nlp_pipeline():
    """Benchmark the full NLP pipeline: load model + process text through pos_tag,
    sent_tokenize, and word_tokenize to confirm excluded components don't break
    downstream usage and to capture end-to-end memory impact."""
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize

    for text in _SAMPLE_TEXTS:
        tags = pos_tag(text)
        sents = sent_tokenize(text)
        tokens = word_tokenize(text)
        assert len(tags) > 0
        assert len(sents) > 0
        assert len(tokens) > 0

Test plan

  • codeflash compare --memory confirms -26% peak memory on _load_spacy_model (53.3 to 39.3 MiB)
  • Full NLP pipeline (pos_tag + sent_tokenize + word_tokenize) shows -23% peak memory (54.9 to 42.1 MiB)
  • pos_tag, sent_tokenize, word_tokenize all produce correct results with excluded components

Copy link
Copy Markdown
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trade-off — sentence splitting quality:

  • Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
  • After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
  • This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.

I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.

KRRT7 added 3 commits March 27, 2026 13:51
Only tok2vec, tagger, and sentence splitting are used (pos_tag and
sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when
loading en_core_web_sm, and add lightweight sentencizer to replace the
dependency parser for sentence boundary detection.

Saves ~12 MiB of model weights per process.
Per review feedback, removing parser and using sentencizer causes
sentence splitting regressions. Keep parser loaded, only exclude
ner, lemmatizer, and attribute_ruler.
@KRRT7 KRRT7 force-pushed the mem/spacy-exclude-unused branch from 2291c23 to 23c4fff Compare March 27, 2026 18:53
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Mar 27, 2026

Good call — updated to keep parser loaded for accurate sentence boundaries. Now only excluding ner, lemmatizer, and attribute_ruler. Memory savings drop from ~12.7 MiB to ~7 MiB but we avoid the sentence splitting regression.

Also rebased onto main and bumped to 0.22.9.

@cragwolfe cragwolfe enabled auto-merge March 31, 2026 17:52
@cragwolfe cragwolfe added this pull request to the merge queue Mar 31, 2026
Merged via the queue into Unstructured-IO:main with commit a3172f8 Mar 31, 2026
53 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants