mem: exclude unused spaCy pipeline components to reduce model memory#4296
Merged
cragwolfe merged 4 commits intoUnstructured-IO:mainfrom Mar 31, 2026
Merged
mem: exclude unused spaCy pipeline components to reduce model memory#4296cragwolfe merged 4 commits intoUnstructured-IO:mainfrom
cragwolfe merged 4 commits intoUnstructured-IO:mainfrom
Conversation
badGarnet
reviewed
Mar 27, 2026
Collaborator
badGarnet
left a comment
There was a problem hiding this comment.
The trade-off — sentence splitting quality:
- Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
- After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
- This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.
I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.
Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection. Saves ~12 MiB of model weights per process.
Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.
2291c23 to
23c4fff
Compare
Collaborator
Author
|
Good call — updated to keep Also rebased onto main and bumped to 0.22.9. |
badGarnet
approved these changes
Mar 31, 2026
cragwolfe
approved these changes
Mar 31, 2026
Merged
via the queue into
Unstructured-IO:main
with commit Mar 31, 2026
a3172f8
53 of 54 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Only tok2vec, tagger, and sentence splitting are used (
pos_tagandsent_tokenize). Excludener,lemmatizer, andattribute_rulerwhen loadingen_core_web_sm, keepingparserfor accurate sentence boundary detection. Saves ~14 MiB peak memory per process.Benchmark
Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM, Python 3.12.12
test_benchmark_load_spacy_modelb6cf510684e5(base)a3172f8eb66b(head)test_benchmark_spacy_nlp_pipelineb6cf510684e5(base)a3172f8eb66b(head)Generated by codeflash compare
Reproduce the benchmark locally
# Full comparison (timing + memory): uv run codeflash compare b6cf510684e594d6c18e19129b6b8da668072b2d a3172f8eb66bb42674fbc70e18b49f4dbe1dc30b --memory \ --inject benchmarks/test_benchmark_spacy_load.py \ --inject benchmarks/__init__.py \ --inject pyproject.tomlBenchmark test source
Test plan
codeflash compare --memoryconfirms -26% peak memory on_load_spacy_model(53.3 to 39.3 MiB)